{"title": "Learning Adaptive Value of Information for Structured Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 953, "page_last": 961, "abstract": "Discriminative methods for learning structured models have enabled wide-spread use of very rich feature representations. However, the computational cost of feature extraction is prohibitive for large-scale or time-sensitive applications, often dominating the cost of inference in the models. Significant efforts have been devoted to sparsity-based model selection to decrease this cost. Such feature selection methods control computation statically and miss the opportunity to fine-tune feature extraction to each input at run-time. We address the key challenge of learning to control fine-grained feature extraction adaptively, exploiting non-homogeneity of the data. We propose an architecture that uses a rich feedback loop between extraction and prediction. The run-time control policy is learned using efficient value-function approximation, which adaptively determines the value of information of features at the level of individual variables for each input. We demonstrate significant speedups over state-of-the-art methods on two challenging datasets. For articulated pose estimation in video, we achieve a more accurate state-of-the-art model that is simultaneously 4$\\times$ faster while using only a small fraction of possible features, with similar results on an OCR task.", "full_text": "Learning Adaptive Value of Information\n\nfor Structured Prediction\n\nDavid Weiss\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA\n\ndjweiss@cis.upenn.edu\n\ntaskar@cs.washington.edu\n\nUniversity of Washington\n\nBen Taskar\n\nSeattle, WA\n\nAbstract\n\nDiscriminative methods for learning structured models have enabled wide-spread\nuse of very rich feature representations. However, the computational cost of fea-\nture extraction is prohibitive for large-scale or time-sensitive applications, often\ndominating the cost of inference in the models. Signi\ufb01cant efforts have been de-\nvoted to sparsity-based model selection to decrease this cost. Such feature se-\nlection methods control computation statically and miss the opportunity to \ufb01ne-\ntune feature extraction to each input at run-time. We address the key challenge\nof learning to control \ufb01ne-grained feature extraction adaptively, exploiting non-\nhomogeneity of the data. We propose an architecture that uses a rich feedback\nloop between extraction and prediction. The run-time control policy is learned us-\ning ef\ufb01cient value-function approximation, which adaptively determines the value\nof information of features at the level of individual variables for each input. We\ndemonstrate signi\ufb01cant speedups over state-of-the-art methods on two challeng-\ning datasets. For articulated pose estimation in video, we achieve a more accurate\nstate-of-the-art model that is also faster, with similar results on an OCR task.\n\n1\n\nIntroduction\n\nEffective models in complex computer vision and natural language problems try to strike a favorable\nbalance between accuracy and speed of prediction. One source of computational cost is inference in\nthe model, which can be addressed with a variety of approximate inference methods. However, in\nmany applications, computing the scores of the constituent parts of the structured model\u2013i.e. feature\ncomputation\u2013is the primary bottleneck. For example, when tracking articulated objects in video,\noptical \ufb02ow is a very informative feature that often requires many seconds of computation time per\nframe, whereas inference for an entire sequence typically requires only fractions of a second [16];\nin natural language parsing, feature computation may take up to 80% of the computation time [7].\nIn this work, we show that large gains in the speed/accuracy trade-off can be obtained by departing\nfrom the traditional method of \u201cone-size-\ufb01ts-all\u201d model and feature selection, in which a static set\nof features are computed for all inputs uniformly. Instead, we employ an adaptive approach: the\nparts of the structured model are constructed speci\ufb01cally at test-time for each particular instance, for\nexample, at the level of individual video frames. There are several key distinctions of our approach:\n\u2022 No generative model. One approach is to assume a joint probabilistic model of the input\nand output variables and a utility function measuring payoffs. The expected value of infor-\nmation measures the increase in expected utility after observing a given variable [12, 8].\nUnfortunately, the problem of computing optimal conditional observation plans is compu-\ntationally intractable even for simple graphical models like Naive Bayes [9]. Moreover,\njoint models of input and output are typically quite inferior in accuracy to discriminative\nmodels of output given input [10, 3, 19, 1].\n\n1\n\n\f\u2022 Richly parametrized, conditional value function. The central component of our method\nis an approximate value function that utilizes a set of meta-features to estimate future\nchanges in value of information given a predictive model and a proposed feature set as in-\nput. The critical advantage here is that the meta-features can incorporate valuable properties\nbeyond con\ufb01dence scores from the predictive model, such as long-range input-dependent\ncues that convey information about the self-consistency of a proposed output.\n\u2022 Non-myopic reinforcement learning. We frame the control problem in terms of \ufb01nd-\ning a feature extraction policy that sequentially adds features to the models until a budget\nlimit is reached, and we show how to learn approximate policies that result in accurate\nstructured models that are dramatically more ef\ufb01cient. Speci\ufb01cally, we learn to weigh the\nmeta-features for the value function using linear function approximation techniques from\nreinforcement learning, where we utilize a deterministic model that can be approximately\nsolved with a simple and effective sampling scheme.\n\nIn summary, we provide a discriminative, practical architecture that solves the value of information\nproblem for structured prediction problems. We \ufb01rst learn a prediction model that is trained to use\nsubsets of features computed sparsely across the structure of the input. These feature combinations\nfactorize over the graph structure, and we allow for sparsely computed features such that different\nvertices and edges may utilize different features of the input. We then use reinforcement learning to\nestimate a value function that adaptively computes an approximately optimal set of features given a\nbudget constraint. Because of the particular structure of our problem, we can apply value function\nestimation in a batch setting using standard least-squares solvers. Finally, we apply our method to\ntwo sequential prediction domains: articulated human pose estimation and handwriting recognition.\nIn both domains, we achieve more accurate prediction models that utilize less features than the\ntraditional monolithic approach.\n\n2 Related Work\n\nThere is a signi\ufb01cant amount of prior work on the issue of controlling test-time complexity. How-\never, much of this work has focused on the issue of feature extraction for standard classi\ufb01cation\nproblems, e.g. through cascades or ensembles of classi\ufb01ers that use different subsets of features at\ndifferent stages of processing. More recently, feature computation cost has been explicitly incorpo-\nrated speci\ufb01cally into the learning procedure (e.g., [6, 14, 2, 5].) The most related recent work of this\ntype is [20], who de\ufb01ne a reward function for multi-class classi\ufb01cation with a series of increasingly\ncomplex models, or [6], who de\ufb01ne a feature acquisition model similar in spirit to ours, but with\na different reward function and modeling a variable trade-off rather than a \ufb01xed budget. We also\nnote that [4] propose explicitly modeling the value of evaluating a classi\ufb01er, but their approach uses\nensembles of pre-trained models (rather than the adaptive model we propose). And while the goals\nof these works are similar to ours\u2013explicitly controlling feature computation at test time\u2013none of the\nclassi\ufb01er cascade literature addresses the structured prediction nor the batch setting.\nMost work that addresses learning the accuracy/ef\ufb01ciency trade-off in a structured setting applies\nprimarily to inference, not feature extraction. E.g., [23] extend the idea of a classi\ufb01er cascade to\nthe structured prediction setting, with the objective de\ufb01ned in terms of obtaining accurate inference\nin models with large state spaces after coarse-to-\ufb01ne pruning. More similar to this work, [7] incre-\nmentally prune the edge space of a parsing model using a meta-features based classi\ufb01er, reducing\nthe total the number of features that need to be extracted. However, both of these prior efforts rely\nentirely on the marginal scores of the predictive model in order to make their pruning decisions, and\ndo not allow future feature computations to rectify past mistakes, as in the case of our work.\nMost related is the prior work of [22], in which one of an ensemble of structured models is selected\non a per-example basis. This idea is essentially a coarse sub-case of the framework presented in this\nwork, without the adaptive predictive model that allows for composite features that vary across the\ninput, without any reinforcement learning to model the future value of taking a decision (which is\ncritical to the success of our method), and without the local inference method proposed in Section 4.\nIn our experiments (Section 5), the \u201cGreedy (Example)\u201d baseline is representative of the limitations\nof this earlier approach.\n\n2\n\n\fAlgorithm 1: Inference for x and budget B.\nde\ufb01ne an action a as a pair (cid:104)\u03b1 \u2208 G, t \u2208 {1, . . . , T}(cid:105) ;\ninitialize B(cid:48) \u2190 0, z \u2190 0, y \u2190 h(x, z) ;\ninitialize action space (\ufb01rst tier) A = {(\u03b1, 1) | \u03b1 \u2208 G};\nwhile B(cid:48) < B and |A| > 0 do\n\na \u2190 argmaxa\u2208A \u03b2(cid:62)\u03c6(x, z, a);\nA \u2190 A \\ a;\nif ca \u2264 (B \u2212 B(cid:48)) then\nz \u2190 z + a, B(cid:48) \u2190 B(cid:48) + ca, y \u2190 h(x, z);\nA \u2190 A \u222a (\u03b1, t + 1);\n\nend\n\nend\n\nFigure 1: Overview of our approach. (Left) A high level summary of the processing pipeline: as in standard\nstructured prediction, features are extracted and inference is run to produce an output. However, information\nmay optionally feedback in the form of extracted meta-features that are used by a control policy to determine\nanother set of features to be extracted. Note that we use stochastic subgradient to learn the inference model w\n\ufb01rst and reinforcement learning to learn the control model \u03b2 given w. (Right) Detailed algorithm for factor-\nwise inference for an example x given a graph structure G and budget B. The policy repeatedly selects the\nhighest valued action from an action space A that represents extracting features for each constituent part of the\ngraph structure G.\n\n3 Learning Adaptive Value of Information for Structured Prediction\n\nSetup. We consider the setting of structured prediction, in which our goal is to learn a hypothesis\nmapping inputs x \u2208 X to outputs y \u2208 Y(x), where |x| = L and y is a L-vector of K-valued\nvariables, i.e. Y(x) = Y1\u00d7\u00b7\u00b7\u00b7\u00d7Y(cid:96) and each Yi = {1, . . . , K}. We follow the standard max-margin\nstructured learning approach [18] and consider linear predictive models of the form w(cid:62)f (x, y).\nHowever, we introduce an additional explicit feature extraction state vector z:\n\nh(x, z) = argmax\ny\u2208Y(x)\n\nw(cid:62)f (x, y, z).\n\n(1)\n\nAbove, f (x, y, z) is a sparse vector of D features that takes time c(cid:62)z to compute for a non-negative\ncost vector c and binary indicator vector z of length |z| = F . Intuitively, z indicates which of F sets\nof features are extracted when computing f; z = 1 means every possible feature is extracted, while\nz = 0 means that only a minimum set of features is extracted.\nNote that by incorporating z into the feature function, the predictor h can learn to use different linear\nweights for the same underlying feature value by conditioning the feature on the value of z. As we\ndiscuss in Section 5, adapting the weights in this way is crucial to building a predictor h that works\nwell for any subset of features. We will discuss how to construct such features in more detail in\nSection 4.\nSuppose we have learned such a model h. At test time, our goal is to make the most accurate\npredictions possible for an example under a \ufb01xed budget B. Speci\ufb01cally, given h and a loss function\n(cid:96) : Y \u00d7 Y (cid:55)\u2192 R+, we wish to \ufb01nd the following:\nH(x, B) = argmin\n\nEy|x[(cid:96)(y, h(x, z))]\n\n(2)\n\nz\n\nIn practice, there are three primary dif\ufb01culties in optimizing equation (2). First, the distribution\nP (Y |X) is unknown. Second, there are exponentially many zs to explore. Most important, how-\never, is the fact that we do not have free access to the objective function. Instead, given x, we are\noptimizing over z using a function oracle since we cannot compute f (x, y, z) without paying c(cid:62)z,\nand the total cost of all the calls to the oracle must not exceed B. Our approach to solving these\nproblems is outlined in Figure 1; we learn a control model (i.e. a policy) by posing the optimization\nproblem as an MDP and using reinforcement learning techniques.\nAdaptive extraction MDP. We model the budgeted prediction optimization as the following Markov\nDecision Process. The state of the MDP s is the tuple (x, z) for an input x and feature extraction\n\n3\n\nINPUTEXTRACTFEATURESINFERENCEPOLICYEXTRACTMETA-FEATURESOUTPUT\fstate z (for brevity we will simply write s). The start state is s0 = (x, 0), with x \u223c P (X), and\nz = 0 indicating only a minimal set of features have been extracted. The action space A(s) is\n{i | zi = 0}\u222a{0}, where zi is the i\u2019the element of z; given a state-action pair (s, a), the next state is\ndeterministically s(cid:48) = (x, z + ea), where ea is the indicator vector with a 1 in the a\u2019th component or\nthe zero vector if a = 0. Thus, at each state we can choose to extract one additional set of features,\nor none at all (at which point the process terminates.) Finally, for \ufb01xed h, we de\ufb01ne the shorthand\n\u03b7(s) = Ey|x(cid:96)(y, h(x, z)) to be the expected error of the predictor h given state z and input x.\nWe now de\ufb01ne the expected reward to be the adaptive value of information of extracting the a\u2019th set\nof features given the system state and budget B:\n\n(cid:26)\u03b7(s) \u2212 \u03b7(s(cid:48))\n\nR(s, a, s(cid:48)) =\n\n0\n\nif c(cid:62)z(s(cid:48)) \u2264 B\notherwise\n\n(3)\n\nIntuitively, (3) says that each time we add additional features to the computation, we gain reward\nequal to the decrease in error achieved with the new features (or pay a penalty if the error increases.)\nHowever, if we ever exceed the budget, then any further decrease does not count; no more reward\ncan be gained. Furthermore, assuming f (x, y, z) can be cached appropriately, it is clear that we pay\nonly the additional computational cost ca for each action a, so the entire cumulative computational\nburden of reaching some state s is exactly c(cid:62)z for the corresponding z vector.\nGiven a trajectory of states s0, s1, . . . , sT , computed by some deterministic policy \u03c0, it is clear that\nthe \ufb01nal cumulative reward R\u03c0(s0) is the difference between the starting error rate and the rate of\nthe last state satisfying the budget:\n\nR\u03c0(s0) = \u03b7(s0) \u2212 \u03b7(s1) + \u03b7(s1) \u2212 \u00b7\u00b7\u00b7 = \u03b7(s0) \u2212 \u03b7(st(cid:63) ),\n\n(4)\nwhere t(cid:63) is the index of the \ufb01nal state within the budget constraint. Therefore, the optimal policy\n\u03c0(cid:63) that maximizes expected reward will compute z(cid:63) minimizing (2) while satisfying the budget\nconstraint.\nLearning an approximate policy with long-range meta-features. In this work, we focus on a\nstraightforward method for learning an approximate policy: a batch version of least-squares policy\niteration [11] based on Q-learning [21]. We parametrize the policy using a linear function of meta-\nfeatures \u03c6 computed from the current state s = (x, z): \u03c0\u03b2(s) = argmaxa \u03b2(cid:62)\u03c6(x, z, a). The meta-\nfeatures (which we abbreviate as simply \u03c6(s, a) henceforth) need to be rich enough to represent\nthe value of choosing to expand feature a for a given partially-computed example (x, z). Note that\nwe already have computed f (x, h(x, z), z), which may be useful in estimating the con\ufb01dence of\nthe model on a given example. However, we have much more freedom in choosing \u03c6(s, a) than\nwe had in choosing f; while f is restricted to ensure that inference is tractable, we have no such\nrestriction for \u03c6. We therefore compute functions of h(x, z) that take into account large sets of\noutput variables, and since we need only compute them for the particular output h(x, z), we can\ndo so very ef\ufb01ciently. We describe the speci\ufb01c \u03c6 we use in our experiments in Section 5, typically\nmeasuring the self-consistency of the output as a surrogate for the expected accuracy.\nOne-step off-policy Q-learning with least-squares. To simplify the notation, we will assume given\ncurrent state s, taking action a deterministically yields state s(cid:48). Given a policy \u03c0, the value of a policy\nis recursively de\ufb01ned as the immediate expected reward plus the discounted value of the next state:\n(5)\nThe goal of Q-learning is to learn the Q for the optimal policy \u03c0(cid:63) with maximal Q\u03c0(cid:63); however, it is\nclear that we can increase Q by simply stopping early when Q\u03c0(s, a) < 0 (the future reward in this\ncase is simply zero.) Therefore, we de\ufb01ne the off-policy optimized value Q(cid:63)\n\nQ\u03c0(s, a) = R(s, a, s(cid:48)) + \u03b3Q\u03c0(s(cid:48), \u03c0(s(cid:48))).\n\n\u03c0 as follows:\n\nQ(cid:63)\n\n\u03c0(st+1, \u03c0(st+1))]+ .\n\n\u03c0(st, \u03c0(st)) = R(st, \u03c0(st), st+1) + \u03b3 [Q(cid:63)\n\n(6)\nWe propose the following one-step algorithm for learning Q from data. Suppose we have a \ufb01nite\ntrajectory s0, . . . , sT . Because both \u03c0 and the state transitions are deterministic, we can unroll the\nrecursion in (6) and compute Q(cid:63)\n\u03c0(st, \u03c0(st)) for each sample using simple dynamic programming.\n\u03c0(si, \u03c0(si)) = \u03b7(si) \u2212\nFor example, if \u03b3 = 1 (there is no discount for future reward), we obtain Q(cid:63)\n\u03b7(st(cid:63) ), where t(cid:63) is the optimal stopping time that satis\ufb01es the given budget.\nWe therefore learn parameters \u03b2(cid:63) for an approximate Q as follows. Given an initial policy \u03c0, we\nexecute \u03c0 for each example (xj, yj) to obtain trajectories sj\nT . We then solve the following\n\n0, . . . , sj\n\n4\n\n\fleast-squares optimization,\n\n\u03b2(cid:63) = argmin\n\n\u03bb||\u03b2||2 +\n\n\u03b2\n\n(cid:88)\n\n(cid:16)\n\nj,t\n\n1\nnT\n\n\u03b2(cid:62)\u03c6(sj\n\nt , \u03c0(sj\n\nt )) \u2212 Q(cid:63)\n\n(cid:17)2\n\n\u03c0(sj\n\nt , \u03c0(sj\n\nt ))\n\n,\n\n(7)\n\nusing cross validation to determine the regularization parameter \u03bb.\nWe perform a simple form of policy iteration as follows. We \ufb01rst initialize \u03b2 by estimating the\nexpected reward function (this can be estimated by randomly sampling pairs (s, s(cid:48)), which are more\nef\ufb01cient to compute than Q-functions on trajectories). We then compute trajectories under \u03c0\u03b2 and\nuse these trajectories to compute \u03b2(cid:63) that approximates Q(cid:63)\n\u03c0. We found that additional iterations of\npolicy iteration did not noticeably change the results.\nLearning for multiple budgets. One potential drawback of our approach just described is that we\nmust learn a different policy for every desired budget. A more attractive alternative is to learn a\nsingle policy that is tuned to a range of possible budgets. One solution is to set \u03b3 = 1 and learn\nwith B = \u221e, so that the value Q(cid:63)\n\u03c0 represents the best improvement possible using some optimal\nbudget B(cid:63); however, at test time, it may be that B(cid:63) is greater than the available budget B and Q(cid:63)\n\u03c0 is\nan over-estimate. By choosing \u03b3 < 1, we can trade-off between valuing reward for short-term gain\nwith smaller budgets B < B(cid:63) and longer-term gain with the unknown optimal budget B(cid:63).\nIn fact, we can further encourage our learned policy to be useful for smaller budgets by adjusting\nthe reward function. Note that two trajectories that start at s0 and end at st(cid:63) will have the same\nreward, yet one trajectory might maintain much lower error rate than the other during the process\nand therefore be more useful for smaller budgets. We therefore add a shaping component to the\nexpected reward in order to favor the more useful trajectory as follows:\n\nR\u03b1(s, a, s(cid:48)) = \u03b7(s) \u2212 \u03b7(s(cid:48)) \u2212 \u03b1 [\u03b7(s(cid:48)) \u2212 \u03b7(s)]+ .\n\n(8)\nThis modi\ufb01cation introduces a term that does not cancel when transitioning from one state to the\nnext, if the next state has higher error than our current state. Thus, we can only achieve optimal\nreward \u03b7(s0) \u2212 \u03b7(st(cid:63) ) when there is a sequence of feature extractions that never increases the error\nrate1; if such a sequence does not exist, then the parameter \u03b1 controls the trade-off between the\nimportance of reaching st(cid:63) and minimizing any errors along the way. Note that we can still use the\nprocedure described above to learn \u03b2 when using R\u03b1 instead of R. We use a development set to\ntune \u03b1 as well as \u03b3 to \ufb01nd the most useful policy when sweeping B across a range of budgets.\nBatch mode inference. At test time, we are typically given a test set of m examples, rather than\na single example. In this setting the budget applies to the entire inference process, and it may be\nuseful to spend more of the budget on dif\ufb01cult examples rather than allocate the budget evenly across\nall examples. In this case, we extend our framework to concatenate the states of all m examples\ns = (x1, . . . , xm, z1, . . . , zm). The action consists of choosing an example and then choosing\nan action within that example\u2019s sub-state; our policy searches over the space of all actions for all\nexamples simultaneously. Because of this, we impose additional constraints on the action space,\nspeci\ufb01cally:\n\n\u2200a(cid:48) < a.\n\nz(a, . . . ) = 1 =\u21d2 z(a(cid:48), . . . ) = 1,\n\n(9)\nEquation (9) states that there is an inherent ordering of feature extractions, such that we cannot\ncompute the a\u2019th feature set without \ufb01rst computing feature sets 1, . . . , a\u2212 1. This greatly simpli\ufb01es\nthe search space in the batch setting while at the same time preserving enough \ufb02exibility to yield\nsigni\ufb01cant improvements in ef\ufb01ciency.\nBaselines. We compare to two baselines: a simply entropy-based approach and a more complex\nimitation learning scheme (inspired by [7]) in which we learn a classi\ufb01er to reproduce a target\npolicy given by an oracle. The entropy-based approach simply computes probabilistic marginals\nand extracts features for whichever portion of the output space has highest entropy in the predicted\ndistribution. For the imitation learning model, we use the same trajectories used to learn Q(cid:63)\n\u03c0, but\ninstead we create a classi\ufb01cation dataset of positive and negative examples given a budget B by\nassigning all state/action pairs along a trajectory within the budget as positive examples and all\nbudget violations as negative examples. We tune the budget B using a development set to optimize\nthe overall trade-off when the policy is evaluated with multiple budgets.\n\n1While adding features decreases training error on average, even on the training set additional features may\n\nlead to increased error for any particular example.\n\n5\n\n\fFeature\nTier (T )\n\n4\n3\n2\n1\n\nBest\n\nError (%)\n\n44.07\n46.17\n46.98\n51.49\n43.45\n\nFixed\n16.20s\n12.00s\n5.50s\n2.75s\n\u2014\n\nTime (s)\nEntropy Q-Learn\n16.20s\n8.10s\n6.80s\n\u2014\n\u2014\n\n8.91s\n5.51s\n4.86s\n\u2014\n\n13.45s\n\nTable 1: Trade-off between average elbow and wrist error rate and total runtime time achieved by our method\non the pose dataset; each row \ufb01xes an error rate and determines the amount of time required by each method\nto achieve the error. Unlike using entropy-based con\ufb01dence scores, our Q-learning approach always improves\nruntime over a priori selection and even yields a faster model that is also more accurate (\ufb01nal row).\n\n4 Design of the information-adaptive predictor h\nLearning. We now address the problem of learning h(x, z) from n labeled data points {(xj, yj}n\nj=1.\nSince we do not necessarily know the test-time budget during training (nor would we want to repeat\nthe training process for every possible budget), we formulate the problem of minimizing the expected\ntraining loss according to a uniform distribution over budgets:\n\nw(cid:63) = argmin\n\n\u03bb||w||2 +\n\nw\n\n1\nn\n\nEz[(cid:96)(yj, h(xj, z)].\n\n(10)\n\nn(cid:88)\n\nj=1\n\nNote that if (cid:96) is convex, then (10) is a weighted sum of convex functions and is also convex. Our\nchoice of distribution for z will determine how the predictor h is calibrated. In our experiments, we\nsampled z\u2019s uniformly at random. To learn w, we use Pegasos-style [17] stochastic sub-gradient de-\nscent; we approximate the expectation in (10) by resampling z every time we pick up a new example\n(xj, yj). We set \u03bb and a stopping-time criterion through cross-validation onto a development set.\nFeature design. We now turn to the question of designing f (x, y, z). In the standard pair-wise\ngraphical model setting (before considering z), we decompose a feature function f (x, y) into unary\nand pairwise features. We consider several different schemes of incorporating z of varying com-\nplexity. The simplest scheme is to use several different feature functions f 1, . . . , f F . Then |z| = F ,\nand za = 1 indicates that f a is computed. Thus, we have the following expression, where we use\nz(a) to indicate the a\u2019th element of z:\n\nF(cid:88)\n\na=1\n\n\uf8ee\uf8f0(cid:88)\n\ni\u2208V\n\n(cid:88)\n\n(i,j)\u2208E\n\n\uf8f9\uf8fb\n\nf (x, y, z) =\n\nz(a)\n\nf a\nu (x, yi) +\n\nf a\ne (x, yi, yj)\n\n(11)\n\nNote that in practice we can choose each f a to be a sparse vector such that f a\u00b7 f a(cid:48)\nthat is, each feature function f a \u201c\ufb01lls out\u201d a complementary section of the feature vector f.\nA much more powerful approach is to create a feature vector as the composite of different extracted\nfeatures for each vertex and edge in the model. In this setting, we set z = [zu ze], where |z| =\n(|V| + |E|)F , and we have\n\n= 0 for all a(cid:48) (cid:54)= a;\n\n(cid:88)\n\nF(cid:88)\n\ni\u2208V\n\na=1\n\n(cid:88)\n\nF(cid:88)\n\n(i,j)\u2208E\n\na=1\n\nf (x, y, z) =\n\nzu(a, i)f a\n\nu (x, yi) +\n\nze(a, ij)f a\n\ne (x, yi, yj).\n\n(12)\n\nWe refer to this latter feature extraction method a factor-level feature extraction, and the former as\nexample-level.2\nReducing inference overhead. Feature computation time is only one component of the computa-\ntional cost in making predictions; computing the argmax (1) given f can also be expensive. Note\n\n2The restriction (9) also allows us to increase the complexity of the feature function f as follows; when\nusing the a\u2019th extraction, we allow the model to re-weight the features from extractions 1 through a. In other\nwords, we condition the value of the feature on the current set of features that have been computed; since\nthere are only F sets in the restricted setting (and not 2F ), this is a feasible option. We simply de\ufb01ne \u02c6f a =\n[0 . . . f 1 . . . f a . . . 0], where we add duplicates of features f 1 through f a for each feature block a. Thus,\nthe model can learn different weights for the same underlying features based on the current level of feature\nextraction; we found that this was crucial for optimal performance.\n\n6\n\n\fFigure 2: Trade-off performance on the pose dataset for wrists (left) and elbows (right). The curve shows\nthe increase in accuracy over the minimal-feature model as a function of total runtime per frame (including\nall overhead). We compare to two baselines that involve no learning: forward selection and extracting factor-\nwise features based on the entropy of marginals at each position (\u201cEntropy\u201d). The learned policy results are\neither greedy (\u201cGreedy\u201d example-level and factor-level) or non-myopic (either our \u201cQ-learning\u201d or the baseline\n\u201cImitation\u201d). Note that the example-wise method is far less effective than the factor-wise extraction strategy.\nFurthermore, Q-learning in particular achieves higher accuracy models at a fraction of the computational cost\nof using all features, and is more effective than imitation learning.\n\nthat for reasons of simplicity, we only consider low tree-width models in this work for which (1) can\nbe ef\ufb01ciently solved via a standard max-sum message-passing algorithm. Nonetheless, since \u03c6(s, a)\nrequires access to h(x, z) then we must run message-passing every time we compute a new state s\nin order to compute the next action. Therefore, we run message passing once and then perform less\nexpensive local updates using saved messages from the previous iteration. We de\ufb01ne an simple al-\ngorithm for such quiescent inference (given in the Supplemental material); we refer to this inference\nscheme as q-inference. The intuition is that we stop propagating messages once the magnitude of\nthe update to the max-marginal decreases below a certain threshold q; we de\ufb01ne q in terms of the\nmargin of the current MAP decoding at the given position, since that margin must be surpassed if\nthe MAP decoding will change as a result of inference.\n\n5 Experiments\n\n5.1 Tracking of human pose in video\n\nSetup. For this problem, our goal is to predict the joint locations of human limbs in video clips\nextracted from Hollywood movies. Our testbed is the MODEC+S model proposed in [22]; the\nMODEC+S model uses the MODEC model of [15] to generate 32 proposed poses per frame of a\nvideo sequence, and then combines the predictions using a linear-chain structured sequential pre-\ndiction model. There are four types of features used by MODEC+S, the \ufb01nal and most expensive\nof which is a coarse-to-\ufb01ne optical \ufb02ow [13]; we incrementally compute poses and features to mini-\nmize the total runtime. For more details on the dataset/features, see [22]. We present cross validation\nresults averaged over 40 80/20 train/test splits of the dataset. We measure localization performance\nor elbow and wrists in terms of percentage of times the predicted locations fall within 20 pixels of\nthe ground truth.\nMeta-features. We de\ufb01ne the meta-features \u03c6(s, a) in terms of the targeted position in the sequence\ni and the current predictions y(cid:63) = h(x, z). Speci\ufb01cally, we concatenate the already computed unary\nand edge features of y(cid:63)\ni and its neighbors (conditioned on the value of z at i), the margin of the\ncurrent MAP decoding at position i, and a measure of self-consistency computed on y(cid:63) as follows.\nFor all sets of m frames overlapping with frame i, we extract color histograms for the predicted\narm segments and compute the maximum \u03c72-distance from the \ufb01rst frame to any other frame; we\nthen also add an indicator feature each of these maximum distances exceeds 0.5, and repeat for\nm = 2, . . . , 5. We also add several bias terms for which sets of features have been extracted around\nposition i.\n\n7\n\n246810121416012345678Total runtime (s)\u2206 Accuracy (Wrist)Accuracy gained per Computation (Wrist) Forward SelectionEntropy (Factor)Greedy (Example)Greedy (Factor)ImitationQ\u2212Learning246810121416012345678Total runtime (s)\u2206 Accuracy (Elbow)Accuracy gained per Computation (Elbow) Forward SelectionEntropy (Factor)Greedy (Example)Greedy (Factor)ImitationQ\u2212Learning\fFigure 3: Controlling overhead on the OCR dataset. While our approach is is extremely ef\ufb01cient in terms of\nhow many features are extracted (Left), the additional overhead of inference is prohibitively expensive for the\nOCR task without applying q-inference (Right) with a large threshold. Furthermore, although the example-wise\nstrategy is less ef\ufb01cient in terms of features extracted, it is more ef\ufb01cient in terms of overhead.\nDiscussion. We present a short summary of our pose results in Table 1, and compare to various\nbaselines in Figure 2. We found that our Q-learning approach is consistently more effective than all\nbaselines; Q-learning yields a model that is both more accurate and faster than the baseline model\ntrained with all features. Furthermore, while the feature extraction decisions of the Q-learning model\nare signi\ufb01cantly correlated with the error of the starting predictions (\u03c1 = 0.23), the entropy-based\nare not (\u03c1 = 0.02), indicating that our learned reward signal is much more informative.\n5.2 Handwriting recognition\n\nSetup. For this problem, we use the OCR dataset from [19], which is pre-divided into 10 folds that\nwe use for cross validation. We use three sets of features: the original pixels (free), and two sets\nof Histogram-of-Gradient (HoG) features computed on the images for different bin sizes. Unlike\nthe pose setting, the features are very fast to compute compared to inference. Thus, we evaluate the\neffectiveness of q-inference with various thresholds to minimize inference time. For meta-features,\nwe use the same construction as for pose, but instead of inter-frame \u03c72-distance we use a binary\nindicator as to whether or not the speci\ufb01c m-gram occurred in the training set. The results are\nsummarized in Figure 3; see caption for details.\nDiscussion. Our method is extremely ef\ufb01cient in terms of the features computed for h; however,\nunlike the pose setting, the overhead of inference is on par with the feature computation. Thus, we\nobtain a more accurate model with q = 0.5 that is 1.5\u00d7 faster, even though it uses only 1/5 of the\nfeatures; if the implementation of inference were improved, we would expect a speedup much closer\nto 5\u00d7.\n6 Conclusion\nWe have introduced a framework for learning feature extraction policies and predictive models that\nadaptively select features for extraction in a factor-wise, on-line fashion. On two tasks our approach\nyields models that both more accurate and far more ef\ufb01cient; our work is a signi\ufb01cant step towards\neliminating the feature extraction bottleneck in structured prediction. In the future, we intend to\nextend this approach to apply to loopy model structures where inference is intractable, and more\nimportantly, to allow for features that change the structure of the underlying graph, so that the graph\nstructure can adapt to both the complexity of the input and the test-time computational budget.\n\nAcknowledgements. The authors were partially supported by ONR MURI N000141010934, NSF\nCAREER 1054215, and by STARnet, a Semiconductor Research Corporation program sponsored\nby MARCO and DARPA.\n\n8\n\n00.20.40.60.811.21.41.61.80123456789Improvement (%)Additional Feature Cost (s)OCR Trade\u2212off (Feature Only) Single TierGreedy (Example)Q\u2212learning (q = 0)Q\u2212learning (q = 0.1)Q\u2212learning (q = 0.5)00.511.522.533.544.550123456789Improvement (%)Additional Total Cost (s)OCR Trade\u2212off (Feature+Overhead) Single TierGreedy (Example)Q\u2212learning (q = 0)Q\u2212learning (q = 0.1)Q\u2212learning (q = 0.5)\fReferences\n[1] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov support vector machines. In Proc. ICML,\n\n2003.\n\n[2] M. Chen, Z. Xu, K.Q. Weinberg, O. Chapelle, and D. Kedem. Classi\ufb01er cascade for minimizing feature\n\nevaluation cost. In AISATATS, 2012.\n\n[3] M. Collins. Discriminative training methods for hidden markov models: theory and experiments with\n\nperceptron algorithms. In Proc. EMNLP, 2002.\n\n[4] T. Gao and D. Koller. Active classi\ufb01cation based on value of classi\ufb01er. In NIPS, 2011.\n[5] A. Grubb and D. Bagnell. Speedboost: Anytime prediction with uniform near-optimality. In AISTATS,\n\n2012.\n\n[6] H. He, H. Daum\u00b4e III, and J. Eisner. Imitation learning by coaching. In NIPS, 2012.\n[7] H. He, H. Daum\u00b4e III, and J. Eisner. Dynamic feature selection for dependency parsing. In EMNLP, 2013.\nInformation value theory. Systems Science and Cybernetics, IEEE Transactions on,\n[8] R. A Howard.\n\n2(1):22\u201326, 1966.\n\n[9] Andreas Krause and Carlos Guestrin. Optimal value of information in graphical models. Journal of\n\nArti\ufb01cial Intelligence Research (JAIR), 35:557\u2013591, 2009.\n\n[10] J.D. Lafferty, A. McCallum, and F.C.N. Pereira. Conditional random \ufb01elds: Probabilistic models for\n\nsegmenting and labeling sequence data. In Proc. ICML, 2001.\n\n[11] M. Lagoudakis and R. Parr. Least-squares policy iteration. JMLR, 2003.\n[12] Dennis V Lindley. On a measure of the information provided by an experiment. The Annals of Mathe-\n\nmatical Statistics, pages 986\u20131005, 1956.\n\n[13] C. Liu. Beyond Pixels: Exploring New Representations and Applications for Motion Analysis. PhD thesis,\n\nMIT, 2009.\n\n[14] V.C. Raykar, B. Krishnapuram, and S. Yu. Designing ef\ufb01cient cascaded classi\ufb01ers: tradeoff between\n\naccuracy and cost. In SIGKDD, 2010.\n\n[15] B. Sapp and B. Taskar. MODEC: Multimodal decomposable models for human pose estimation.\n\nCVPR, 2013.\n\nIn\n\n[16] B. Sapp, D. Weiss, and B. Taskar. Parsing human motion with stretchable models. In CVPR, 2011.\n[17] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In\n\nICML, 2007.\n\n[18] B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin. Learning structured prediction models: A large\n\nmargin approach. In ICML, 2005.\n\n[19] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In NIPS, 2003.\n[20] K. Trapeznikov and V. Saligrama. Supervised sequential classi\ufb01cation under budget constraints.\n\nAISTATS, 2013.\n\nIn\n\n[21] C. Watkins and P. Dayan. Q-learning. Machine learning, 1992.\n[22] D. Weiss, B. Sapp, and B. Taskar. Dynamic structured model selection. In ICCV, 2013.\n[23] D. Weiss and B. Taskar. Structured prediction cascades. In AISTATS, 2010.\n\n9\n\n\f", "award": [], "sourceid": 518, "authors": [{"given_name": "David", "family_name": "Weiss", "institution": "University of Pennsylvania"}, {"given_name": "Ben", "family_name": "Taskar", "institution": "University of Washington"}]}