{"title": "Chained Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 1281, "page_last": 1288, "abstract": null, "full_text": "Chained Boosting\n\nChristian R. Shelton\nUniversity of California\n\nRiverside CA 92521\n\ncshelton@cs.ucr.edu\n\nWesley Huie\n\nwhuie@cs.ucr.edu\n\nUniversity of California\n\nRiverside CA 92521\n\nKin Fai Kan\n\nUniversity of California\n\nRiverside CA 92521\nkkan@cs.ucr.edu\n\nAbstract\n\nWe describe a method to learn to make sequential stopping decisions, such as\nthose made along a processing pipeline. We envision a scenario in which a series\nof decisions must be made as to whether to continue to process. Further processing\ncosts time and resources, but may add value. Our goal is to create, based on his-\ntoric data, a series of decision rules (one at each stage in the pipeline) that decide,\nbased on information gathered up to that point, whether to continue processing\nthe part. We demonstrate how our framework encompasses problems from manu-\nfacturing to vision processing. We derive a quadratic (in the number of decisions)\nbound on testing performance and provide empirical results on object detection.\n\n1 Pipelined Decisions\n\nIn many decision problems, all of the data do not arrive at the same time. Often further data collec-\ntion can be expensive and we would like to make a decision without accruing the added cost.\n\nConsider silicon wafer manufacturing. The wafer is processed in a series of stages. After each stage\nsome tests are performed to judge the quality of the wafer. If the wafer fails (due to \ufb02aws), then the\nprocessing time, energy, and materials are wasted. So, we would like to detect such a failure as early\nas possible in the production pipeline.\n\nA similar problem can occur in vision processing. Consider the case of object detection in images.\nOften low-level pixel operations (such as downsampling an image) can be performed in parallel by\ndedicated hardware (on a video capture board, for example). However, searching each subimage\npatch of the whole image to test whether it is the object in question takes time that is proportional to\nthe number of pixels. Therefore, we can imagine a image pipeline in which low resolution versions\nof the whole image are scanned \ufb01rst. Subimages which are extremely unlikely to contain the desired\nobject are rejected and only those which pass are processed at higher resolution. In this way, we\nsave on many pixel operations and can reduce the cost in time to process an image.\n\nEven if downsampling is not possible through dedicated hardware, for most object detection\nschemes, the image must be downsampled to form an image pyramid in order to search for the\nobject at different scales. Therefore, we can run the early stages of such a pipelined detector at the\nlow resolution versions of the image and throw out large regions of the high resolution versions.\nMost of the processing is spent searching for small faces (at the high resolutions), so this method\ncan save a lot of processing.\n\nSuch chained decisions also occur if there is a human in the decision process (to ask further clarifying\nquestions in database search, for instance). We propose a framework that can model all of these\nscenarios and allow such decision rules to be learned from historic data. We give a learning algorithm\nbased on the minimization of the exponential loss and conclude with some experimental results.\n\n\f1.1 Problem Formulation\n\nLet there be s stages to the processing pipeline. We assume that there is a static distribution from\nwhich the parts, objects, or units to be processed are drawn. Let p(x, c) represent this distribution in\nwhich x is a vector of the features of this unit and c represents the costs associated with this unit. In\nparticular, let xi (1 \u2264 i \u2264 s) be the set of measurements (features) available to the decision maker\nimmediately following stage i. Let ci (1 \u2264 i \u2264 s) be the cost of rejecting (or stopping the processing\nof) this unit immediately following stage i. Finally, let c s+1 be the cost of allowing the part to pass\nthrough all processing stages.\nNote that ci need not be monotonic in i. To take our wafer manufacturing example, for wafers that\nare good we might let ci = i for 1 \u2264 i \u2264 s, indicating that if a wafer is rejected at any stage, one\nunit of work has been invested for each stage of processing. For the same good wafers, we might\nlet cs+1 = s \u2212 1000, indicating that the value of a completed wafer is 1000 units and therefore the\ntotal cost is the processing cost minus the resulting value. For a \ufb02awed wafer, the values might be\nthe same, except for cs+1 which we would set to s, indicating that there is no value for a bad wafer.\nNote that the costs may be either positive or negative. However, only their relative values are im-\nportant. Once a part has been drawn from the distribution, there is no way of affecting the \u201cbase\nlevel\u201d for the value of the part. Therefore, we assume for the remainder of this paper that c i \u2265 0 for\n1 \u2264 i \u2264 s + 1 and that ci = 0 for some value of i (between 1 and s + 1).\nOur goal is to produce a series of decision rules f i(xi) for 1 \u2264 i \u2264 s. We let fi have a range of\n{0, 1} and let 0 indicate that processing should continue and 1 indicate that processing should be\nhalted. We let f denote the collection of these s decision rules and augment the collection with an\nadditional rule fs+1 which is identically 1 (for ease of notation). The cost of using these rules to\nhalt processing an example is therefore\n\ns+1(cid:1)\n\ni\u22121(cid:2)\n\nL(f(x), c) =\n\ncifi(xi)\n\n(1 \u2212 fj(xj)) .\n\ni=1\n\nj=1\n\nWe would like to \ufb01nd a set of decision rules that minimize E p[L(f(x), c)].\n(training set) D =\nWhile p(x, c)\nis not known, we do have a series of samples\n{(x1, c1), (x2, c2), . . . , (xn, cn)} of n examples drawn from the distribution p. We use superscripts\nto denote the example index and subscripts to denote the stage index.\n\n2 Boosting Solution\n\nFor this paper, we consider constructing the rules f i from simpler decision rules, much as in the\nAdaboost algorithm [1, 2]. We assume that each decision f i(xi) is computed as the threshold of\nanother function gi(xi): fi(xi) = I(gi(xi) > 0).1 We bound the empirical risk:\n\nn(cid:1)\n\nk=1\n\nL(f(xk), ck) =\n\ns+1(cid:1)\ns+1(cid:1)\n\ni=1\n\nn(cid:1)\n\u2264 n(cid:1)\n\nk=1\n\nk=1\n\ni=1\n\nck\ni\n\ni ) > 0)\n\nI(gi(xk\ni\u22121(cid:2)\n\ni egi(xk\ni )\nck\n\ne\n\nj=1\n\ni\u22121(cid:2)\n\nj=1\n\nI(gj(xk\nj ) \u2264 0)\nn(cid:1)\ns+1(cid:1)\n\nk=1\n\ni=1\n\n\u2212gj (xk\n\nj ) =\n\ni egi(xk\nck\n\ni )\u2212Pi\u22121\n\nj=1 gj (xk\n\nj ) .\n\n(1)\n\nOur decision to make all costs positive ensures that the bounds hold. Our decision to make the\noptimal cost zero helps to ensure that the bound is reasonably tight.\nAs in boosting, we restrict gi(xi) to take the form\nl=1 \u03b1i,lhi,l(xi), the weighted sum of mi sub-\nclassi\ufb01ers, each of which returns either \u22121 or +1. We will construct these weighted sums incremen-\ntally and greedily, adding one additional subclassi\ufb01er and associated weight at each step. We will\npick the stage, weight, and function of the subclassi\ufb01er in order to make the largest negative change\nin the exponential bound to the empirical risk. The subclassi\ufb01ers, h i,l will be drawn from a small\nclass of hypotheses, H.\n\n(cid:3)\n\nmi\n\n1I is the indicator function that equals 1 if the argument is true and 0 otherwise.\n\n\f1. Initialize gi(x) = 0 for all stages i\ni = ck\n2. Initialize wk\n3. For each stage i:\n\ni for all stages i and examples k.\n\n(a) Calculate targets for each training example, as shown in equation 5.\n(b) Let h be the result of running the base learner on this set.\n(c) Calculate the corresponding \u03b1 as per equation 3.\n(d) Score this classi\ufb01cation as per equation 4\n\n4. Select the stage \u00af\u0131 with the best (highest) score. Let \u00afh and \u00af\u03b1 be the classi\ufb01er and\nweight found at that stage.\n5. Let g\u00af\u0131(x) \u2190 g\u00af\u0131(x) + \u00af\u03b1\u00afh(x).\n6. Update the weights (see equation 2):\n\u2022 \u22001 \u2264 k \u2264 n, multiply wk\n\u2022 \u22001 \u2264 k \u2264 n, j > \u00af\u0131, multiply wk\n\n\u00af\u0131 by e \u00af\u03b1\u00afh(xk\n\u00af\u0131 ).\nj by e\n\n\u2212 \u00af\u03b1\u00afh(xk\n\u00af\u0131 ).\n\n7. Repeat from step 3\n\nFigure 1: Chained Boosting Algorithm\n\n2.1 Weight Optimization\n\nWe \ufb01rst assume that the stage at which to add a new subclassi\ufb01er and the subclassi\ufb01er to add have\nalready been chosen: \u00af\u0131 and \u00afh, respectively. That is, \u00afh will become h\u00af\u0131,m\u00af\u0131+1 but we simplify it for\nease of expression. Our goal is to \ufb01nd \u03b1\u00af\u0131,m\u00af\u0131+1 which we similarly abbreviate to \u00af\u03b1. We \ufb01rst de\ufb01ne\n\ni )\u2212Pi\u22121\n\nj=1 gj (xk\nj )\n\ni egi(xk\n\nwk\n\ni = ck\n\nas the weight of example k at stage i, or its current contribution to our risk bound. If we let D +\nthe set of indexes of the members of D for which \u00afh returns +1, and let D\u2212\nthose for which \u00afh returns \u22121, we can further de\ufb01ne\n(cid:1)\n\n(cid:1)\n\n(cid:1)\n\n(cid:1)\n\ns+1(cid:1)\n\ns+1(cid:1)\n\n(2)\n\u00afh be\n\u00afh be similarly de\ufb01ned for\n\nW +\n\n\u00af\u0131 =\n\n\u00af\u0131 +\nwk\n\nwk\ni\n\n\u2212\n\u00af\u0131 =\n\nW\n\n\u00af\u0131 +\nwk\n\nwk\ni\n\n.\n\ni=\u00af\u0131+1\n\nk\u2208D\u2212\n\u00afh\n\nk\u2208D\u2212\n\u00afh\n\nk\u2208D+\n\u00afh\nto be the sum of the weights which \u00afh will emphasize. That is, it corresponds to\nWe interpret W +\n\u00af\u0131\nthe weights along the path that \u00afh selects: For those examples for which \u00afh recommends termination,\nwe add the current weight (related to the cost of stopping the processing at this stage). For those\nexamples for which \u00afh recommends continued processing, we add in all future weights (related to all\n\u2212\nfuture costs associated with this example). W\n\u00af\u0131 can be similarly interpreted to be the weights (or\ncosts) that \u00afh recommends skipping.\nIf we optimize the loss bound of Equation 1 with respect to \u00af\u03b1, we obtain\n\nk\u2208D+\n\u00afh\n\ni=\u00af\u0131+1\n\n\u00af\u03b1 =\n\n1\n2\n\n\u2212\nlog W\n\u00af\u0131\nW +\n\u00af\u0131\n\n.\n\n(3)\n\nThe more weight (cost) that the rule recommends to skip, the higher its \u03b1 coef\ufb01cient.\n\n2.2 Full Optimization\n\nUsing Equation 3 it is straight forward to show that the reduction in Equation 1 due to the addition\nof this new subclassi\ufb01er will be\n\n\u00af\u0131 (1 \u2212 e \u00af\u03b1) + W\nW +\n\n\u00af\u0131 (1 \u2212 e\n\u2212\n\n\u2212 \u00af\u03b1) .\n\n(4)\n\nWe know of no ef\ufb01cient method for determining \u00af\u0131, the stage at which to add a subclassi\ufb01er, except\nby exhaustive search. However, within a stage, the choice of which subclassi\ufb01er to use becomes one\n\n\fof maximizing\n\nn(cid:1)\n\nzk\n\u00af\u0131\n\n\u00afh(xk\n\n\u00af\u0131 ) , where zk\n\n\u00af\u0131 =\n\nk=1\n\n(cid:4)\n\ns+1(cid:1)\n\ni=\u00af\u0131+1\n\n(cid:5)\n\nwk\ni\n\n\u2212 wk\n\n\u00af\u0131\n\n(5)\n\nwith respect to \u00afh. This is equivalent to an weighted empirical risk minimization where the training\nset is {x1\n\u00af\u0131 , and the weight of the same example is the\nmagnitude of z k\n\u00af\u0131 .\n\n}. The label of xk\n\n\u00af\u0131 is the sign of z k\n\n\u00af\u0131 , . . . , xn\n\u00af\u0131\n\n\u00af\u0131 , x2\n\n2.3 Algorithm\n\nThe resulting algorithm is only slightly more complex than standard Adaboost. Instead of a weight\nvector (one weight for each data example), we now have a weight matrix (one weight for each\ndata example for each stage). We initialize each weight to be the cost associated with halting the\ncorresponding example at the corresponding stage. We start with all g i(x) = 0. The complete\nalgorithm is as in Figure 1.\nEach time through steps 3 through 7, we complete one \u201cround\u201d and add one additional rule to one\nstage of the processing. We stop executing this loop when \u00af\u03b1 \u2264 0 or when an iteration counter\nexceeds a preset threshold.\nBottom-Up Variation\n\nIn situations where information is only gained after each stage (such as in section 4), we can also\ntrain the classi\ufb01ers \u201cbottom-up.\u201d That is, we can start by only adding classi\ufb01ers to the last stage.\nOnce \ufb01nished with it, we proceed to the previous stage, and so on. Thus instead of selecting the\nbest stage, i, in each round, we systematically work our way backward through the stages, never\nrevisiting previously set stages.\n\n3 Performance Bounds\n\nUsing the bounds in [3] we can provide a risk bound for this problem. We let E denote the expecta-\ntion with respect to the true distribution p(x, c) and \u02c6En denote the empirical average with respect to\nthe n training samples. We \ufb01rst bound the indicator function with a piece-wise linear function, b \u03b8,\nwith a maximum slope of 1\n\u03b8 :\nI(z > 0) \u2264 b\u03b8(z) = max\n\n(cid:6)\nmin\nWe then bound the loss: L(f(x), c) \u2264 \u03c6\u03b8(f(x), c) where\n\n1, 1 + z\n\u03b8\n\n(cid:7)\n\n(cid:7)\n\n(cid:6)\n\n, 0\n\n.\n\n\u03c6\u03b8(f(x), c) =\n\n=\n\nci min{b\u03b8(gi(xi)), b\u03b8(\u2212gi\u22121(xi\u22121)), b\u03b8(\u2212gi\u22122(xi\u22122)), . . . , b\u03b8(\u2212g1(x1))}\n\nciBi\n\n\u03b8(gi(xi), gi\u22121(xi\u22121), . . . , g1(x1))\n\ns+1(cid:1)\ns+1(cid:1)\n\ni=1\n\ni=1\n\nWe replaced the product of indicator functions with a minimization and then bounded each indicator\nwith b\u03b8. Bi\n\u03b8 is just a more compact presentation of the composition of the function b \u03b8 and the\nminimization. We assume that the weights \u03b1 at each stage have been scaled to sum to 1. This has\nno affect on the resulting classi\ufb01cations, but is necessary for the derivation below. Before stating the\ntheorem, for clarity, we state two standard de\ufb01nition:\nDe\ufb01nition 1. Let p(x) be a probability distribution on the set X and let {x 1, x2, . . . , xn} be n\nindependent samples from p(x). Let \u03c3 1, \u03c32, . . . , \u03c3n be n independent samples from a Rademacher\nrandom variable (a binary variable that takes on either +1 or \u22121 with equal probability). Let F be\na class of functions mapping X to (cid:5).\n(cid:4)\nDe\ufb01ne the Rademacher Complexity of F to be\n\n(cid:5)\n\nRn(F) = E\n\n1\nn\n\nsup\nf\u2208F\n\n(cid:8)(cid:8)(cid:8)(cid:8)(cid:8)\n\n\u03c3if(xi)\n\n(cid:8)(cid:8)(cid:8)(cid:8)(cid:8) n(cid:1)\n\ni=1\n\nwhere the expectation is over the random draws of x1 through xn and \u03c31 through \u03c3n.\n\n\fDe\ufb01nition 2. Let p(x), {x1, x2, . . . , xn}, and F be as above. Let g 1, g2, . . . , gn be n independent\nsamples from a Gaussian distribution with mean 0 and variance 1.\nAnalogous to the above de\ufb01nition, de\ufb01ne the Gaussian Complexity of G to be\n\n(cid:4)\n\n(cid:8)(cid:8)(cid:8)(cid:8)(cid:8) n(cid:1)\n\ni=1\n\nGn(F) = E\n\n1\nn\n\nsup\nf\u2208F\n\ngif(xi)\n\n.\n\n(cid:5)\n\n(cid:8)(cid:8)(cid:8)(cid:8)(cid:8)\n\nWe can now state our theorem, bounding the true risk by a function of the empirical risk:\nTheorem 3. Let H1,H2, . . . ,Hs be the sequence of the sets of functions from which the base classi-\n\ufb01er draws for chain boosting. If Hi is closed under negation for all i, all costs are bounded between\n0 and 1, and the weights for the classi\ufb01ers at each stage sum to 1, then with probability 1 \u2212 \u03b4,\n\ns(cid:1)\n(i + 1)Gn(Hi) +\n\ni=1\n\n(cid:9)\n\n8 ln 2\n\u03b4\n\nn\n\nE [L(f(x), c)] \u2264 \u02c6En [\u03c6\u03b8(f(x), c)] + k\n\n\u03b8\n\nfor some constant k.\n\nProof. Theorem 8 of [3] states\n\n(cid:9)\n\n8 ln 2\n\u03b4\n\nn\n\n(cid:8)(cid:8)(cid:8)(cid:8)(cid:8)(cid:8)\n(cid:8)(cid:8)(cid:8)(cid:8)(cid:8) =\n\nE [L(x, c)] \u2264 \u02c6En (\u03c6\u03b8(f(x), c)) + 2Rn(\u03c6\u03b8 \u25e6 F) +\n\nand therefore we need only bound the R n(\u03c6\u03b8 \u25e6 F) term to demonstrate our theorem. For our case,\nwe have\n\nRn(\u03c6\u03b8 \u25e6 F) = E sup\nf\u2208F\n\n\u03c3i\u03c6\u03b8(f(xi), ci)\n\n(cid:8)(cid:8)(cid:8)(cid:8)(cid:8)\n\ni=1\n\n1\nn\n\n(cid:8)(cid:8)(cid:8)(cid:8)(cid:8) n(cid:1)\n(cid:8)(cid:8)(cid:8)(cid:8)(cid:8)(cid:8) n(cid:1)\ns+1(cid:1)\n(cid:8)(cid:8)(cid:8)(cid:8)(cid:8) n(cid:1)\n\nj=1\n\ni=1\n\n\u03c3i\n\n1\nn\n\ni=1\n\n1\nn\n\n= E sup\nf\u2208F\n\n\u2264 s+1(cid:1)\n\nE sup\nf\u2208F\n\nj=1\n\nci\njBs\n\n\u03b8(gj(xi\n\nj), gj\u22121(xi\n\nj\u22121), . . . , g1(xi\n\n1))\n\n\u03c3iBs\n\n\u03b8(gj(xi\n\nj), gj\u22121(xi\n\nj\u22121), . . . , g1(xi\n\n1))\n\ns+1(cid:1)\n\nj=1\n\nRn(Bs\n\u03b8\n\n\u25e6 Gj)\n\nwhere Gi is the space of convex combinations of functions from H i and Gi is the cross product of\nG1 through Gi. The inequality comes from switching the expectation and the maximization and then\nfrom dropping the ci\n\u25e6 Gj). Theorem 14\n\u03b8 is the\n\u03b8 is also 1\n\u03b8 .)\n\nLemma 4 of [3] states that there exists a k such that R n(Bs\n\u03b8\nof the same paper allows us to conclude that G n(Bs\n\u03b8\nminimum over a set of functions with maximum slope of 1\nTheorem 12, part 2 states Gn(Gi) = Gn(Hi). Taken together, this proves our result.\n\n(cid:3)\n\u25e6 Gj) \u2264 kGn(Bs\ni=1 Gn(Gi). (Because Bs\n\n\u03b8 , the maximum slope of B s\n\nj (see [4], lemma 5).\n\n\u25e6 Gj) \u2264 2\n\n\u03b8\n\n\u03b8\n\nj\n\nNote that this bound has only quadratic dependence on s, the length of the chain and does not\nexplicitly depend on the number of rounds of boosting (the number of rounds affects \u03c6 \u03b8 which, in\nturn, affects the bound).\n\n4 Application\n\nWe tested our algorithm on the MIT face database [5]. This database contains 19-by-19 gray-scale\nimages of faces and non-faces. The training set has 2429 face images and 4548 non-face images.\nThe testing set has 472 faces and 23573 non-faces. We weighted the training set images so that the\nratio of the weight of face images to non-face images matched the ratio in the testing set.\n\n\f0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nl\n\ne\np\nm\na\nx\ne\n\n \nr\ne\np\n\n \nr\no\nr\nr\ne\n\n/\nt\ns\no\nc\n \n\ne\ng\na\nr\ne\nv\na\n\n0\n100\n\n200\n\n300\n\n400 500\n\nnumber of rounds\n\ntraining cost\ntraining error\ntesting cost\ntesting error\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ne\n\nt\n\na\nr\n \n\ne\nv\ni\nt\ni\ns\no\np\n\n \n\nl\n\ne\ns\na\nF\n\n700\n\n1000\n\n0\n0\n0\n0\n0\n0\n\n0.2\n0.2\n0.2\n0.2\n0.2\n\nCB Global\nCB Bottom\u2212up\nSVM\nBoosting\n\n200\n200\n200\n200\n\n150\n150\n150\n150\n\n100\n100\n100\n100\n\n50\n50\n50\n50\n\n0.8\n0.8\n0.8\n0.8\n0.8\n\n0\n0\n0\n0\n1\n1\n1\n1\n1\n\nl\n\ns\ne\nx\np\n\ni\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nn\n\n \n\ne\ng\na\nr\ne\nv\nA\n\n0.4\n0.4\n0.4\n0.4\n0.4\n\n0.6\n0.6\n0.6\n0.6\n0.6\n\nFalse negative rate\n\n(a)\n\n(b)\n\nFigure 2: (a) Accuracy verses the number of rounds for a typical run, (b) Error rates and average\ncosts for a variety of cost settings.\n\n4.1 Object Detection as Chained Boosting\n\nOur goal is to produce a classi\ufb01er that can identify non-face images at very low resolutions, thereby\nallowing for quick processing of large images (as explained later). Most image patches (or sub-\nwindows) do not contain faces. We, therefore, built a multi-stage detection system where any early\nrejection is labeled as a non-face. The \ufb01rst stage looks at image patches of size 3-by-3 (i.e. a lower-\nresolution version of the 19-by-19 original image). The next stage looks at the same image, but at\na resolution of 6-by-6. The third stage considers the image at 12-by-12. We did not present the full\n19-by-19 images as the classi\ufb01cation did not signi\ufb01cantly improve over the 12-by-12 versions.\n\nWe employ a simple base classi\ufb01er: the set of all functions that look at a single pixel and predict the\nclass by thresholding the pixel\u2019s value. The total classi\ufb01er at any stage is a linear combination of\nthese simple classi\ufb01ers. For a given stage, all of the base classi\ufb01ers that target a particular pixel are\nadded together producing a complex function of the value of the pixel. Yet, this pixel can only take\non a \ufb01nite number of values (256 in this case). Therefore, we can compile this set of base classi\ufb01ers\ninto a single look-up function that maps the brightness of the pixel into a real number. The total\nclassi\ufb01er for the whole stage is merely the sum of these look-up functions. Therefore, the total work\nnecessary to compute the classi\ufb01cation at a stage is proportional to the number of pixels in the image\nconsidered at that stage, regardless of the number of base classi\ufb01ers used.\n\nWe therefore assign a cost to each stage of processing proportional to the number of pixels at that\nstage. If the image is a face, we add a negative cost (i.e. bonus) if the image is allowed to pass\nthrough all of the processing stages (and is therefore \u201caccepted\u201d as a face). If the image is a non-\nface, we add a bonus if the image is rejected at any stage before completion (i.e. correctly labelled).\n\nWhile this dataset has only segmented image patches, in a real application, the classi\ufb01er would be\nrun on all sub-windows of an image. More importantly, it would also be run at multiple resolutions\nin order to detect faces of different sizes (or at different distances from the camera). The classi\ufb01er\nchain could be run simultaneously at each of these resolutions. To wit, while running the \ufb01nal 12-by-\n12 stage at one resolution of the image, the 6-by-6 (previous) stage could be run at the same image\nresolution. This 6-by-6 processing would be the necessary pre-processing step to running the 12-by-\n12 stage at a higher resolution. As we run our \ufb01nal scan for big faces (at a low resolution), we can\nalready (at the same image resolution) be performing initial tests to throw out portions of the image\nas not worthy of testing for smaller faces (at a higher resolution). Most of the work of detecting\nobjects must be done at the high resolutions because there are many more overlapping subwindows.\nThis chained method allows the culling of most of this high-resolution image processing.\n\n4.2 Experiments\n\nFor each example, we construct a vector of stage costs as above. We add a constant to this vector to\nensure that the minimal element is zero, as per section 1.1. We scale all vectors by the same amount\n\n\fto ensure that the maximal value is 1.This means that the number of misclassi\ufb01cations is an upper\nbound on the total cost that the learning algorithm is trying to minimize.\n\nThere are three \ufb02exible quantities in this problem formulation: the cost of a pixel evaluation, the\nbonus for a correct face classi\ufb01cation, and the bonus for a correct non-face classi\ufb01cation. Changing\nthese quantities will control the trade-off between false positives and true positives, and between\nclassi\ufb01cation error and speed.\n\nFigure 2(a) shows the result of a typical run of the algorithm. As a function of the number of\nrounds, it plots the cost (that which the algorithm is trying to minimize) and the error (number of\nmisclassi\ufb01ed image patches), for both the training and testing sets (where the training set has been\nreweighted to have the same proportion of faces to non-faces as the testing set).\n\nWe compared our algorithm\u2019s performance to the performance of support vector machines (SVM)\n[6] and Adaboost [1] trained and tested on the highest resolution, 12-by-12, image patches. We\nemployed SVM-light [7] with a linear kernels. Figure 2(b) compares the error rates for the methods\n(solid lines, read against the left vertical axis). Note that the error rates are almost identical for the\nmethods. The dashed lines (read against the right vertical axis) show the average number of pixels\nevaluated (or total processing cost) for each of the methods. The SVM and Adaboost algorithms\nhave a constant processing cost. Our method (by either training scheme) produces lower processing\ncost for most error rates.\n\n5 Related Work\n\nCascade detectors for vision processing (see [8] or [9] for example) may appear to be similar to the\nwork in this paper. Especially at \ufb01rst glance for the area of object detection, they appear almost the\nsame. However, cascade detection and this work (chained detection) are quite different.\n\nCascade detectors are built one at a time. A coarse detector is \ufb01rst trained. The examples which\npass that detector are then passed to a \ufb01ner detector for training, and so on. A series of targets for\nfalse-positive rates de\ufb01ne the increasing accuracy of the detector cascade.\n\nBy contrast, our chain detectors are trained as an ensemble. This is necessary because of two dif-\nferences in the problem formulation. First, we assume that the information available at each stage\nchanges. Second, we assume there is an explicit cost model that dictates the cost of proceeding from\nstage to stage and the cost of rejection (or acceptance) at any particular stage. By contrast, cascade\ndetectors are seeking to minimize computational power necessary for a \ufb01xed decision. Therefore,\nthe information available to all of the stages is the same, and there are no \ufb01xed costs associated with\neach stage.\n\nThe ability to train all of the classi\ufb01ers at the same time is crucial to good performance in our\nframework. The \ufb01rst classi\ufb01er in the chain cannot determine whether it is advantageous to send an\nexample further along unless it knows how the later stages will process the example. Conversely,\nthe later stages cannot construct optimal classi\ufb01cations until they know the distribution of examples\nthat they will see.\n\nSection 4.1 may further confuse the matter. We demonstrated how chained boosting can be used to\nreduce the computational costs of object detection in images. Cascade detectors are often used for\nthe same purpose. However, the reductions in computational time come from two different sources.\nIn cascade detectors, the time taken to evaluate a given image patch is reduced. In our chained\ndetector formulation, image patches are ignored completely based on analysis of lower resolution\npatches in the image pyramid. To further illustrate the difference, cascade detectors can always\nbe used to speed up asymmetric classi\ufb01cation tasks (and are often applied to image detection).\nBy contrast, in Section 4.1 we have exploited the fact that object detection in images is typically\nperformed at multiple scales to turn the problem into a pipeline and apply our framework.\n\nCascade detectors address situations in which prior class probabilities are not equal, while chained\ndetectors address situations in which information is gained at a cost. Both are valid (and separate)\nways of tackling image processing (and other tasks as well). In many ways, they are complementary\napproaches.\n\n\fClassic sequence analysis [10, 11] also addresses the problem of optimal stopping. However, it\nassumes that the samples are drawn i.i.d. from (usually) a known distribution. Our problem is\nquite different in that each consecutive sample is drawn from a different (and related) distribution\nand our goal is to \ufb01nd a decision rule without producing a generative model. WaldBoost [12] is a\nboosting algorithm based on this. It builds a series of features and a ratio comparison test in order\nto decide when to stop. For WaldBoost, the available features (information) not change between\nstages. Rather, any feature is available for selection at any point in the chain. Again, this is a\ndifferent problem than the one considered in this paper.\n\n6 Conclusions\n\nWe feel this framework of staged decision making is useful in a wide variety of areas. This paper\ndemonstrated how the framework applies to one vision processing task. Obviously it also applies\nto manufacturing pipelines where errors can be introduced at different stages. It should also be\napplicable to scenarios where information gathering is costly.\n\nOur current formulation only allows for early negative detection. In the face detection example\nabove, this means that in order to report \u201cface,\u201d the classi\ufb01er must process each stage, even if the\nresult is assured earlier. In Figure 2(b), clearly the upper-left corner (100% false positives and 0%\nfalse negatives) is reachable with little effort: classify everything positive without looking at any\nfeatures. We would like to extend this framework to cover such two-sided early decisions. While\nperhaps not useful in manufacturing (or even face detection, where the interesting part of the ROC\ncurve is far from the upper-left), it would make the framework more applicable to information-\ngathering applications.\n\nAcknowledgements\n\nThis research was supported through the grant \u201cAdaptive Decision Making for Silicon Wafer Test-\ning\u201d from Intel Research and UC MICRO.\n\nReferences\n\n[1] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning\n\nand an application to boosting. In EuroCOLT, pages 23\u201337, 1995.\n\n[2] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In ICML,\n\npages 148\u2013156, 1996.\n\n[3] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds\n\nand structural results. JMLR, 2:463\u2013482, 2002.\n\n[4] Ron Meir and Tong Zhang. Generalization error bounds for Bayesian mixture algorithms.\n\nJMLR, 4:839\u2013860, 2003.\n\n[5] MIT.\n\nCBCL face\ndatasets/FaceData2.html.\n\ndatabase\n\n#1,\n\n2000.\n\nhttp://cbcl.mit.edu/cbcl/software-\n\n[6] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for\n\noptimal margin classi\ufb01ers. In COLT, pages 144\u2013152, 1992.\n[7] T. Joachims. Making large-scale SVM learning practical.\n\nIn B. Schlkopf, C. Burges, and\nA. Smola, editors, Advances in Kernel Methods \u2014 Support Vector Learning. MIT-Press, 1999.\n[8] Paul A. Viola and Michael J. Jones. Rapid object detection using a boosted cascade of simple\n\nfeatures. In CVPR, pages 511\u2013518, 2001.\n\n[9] Jianxin Wu, Matthew D. Mullin, and James M. Rehg. Linear asymmetric classi\ufb01er for cascade\n\ndetectors. In ICML, pages 988\u2013995, 2005.\n\n[10] Abraham Wald. Sequential Analysis. Chapman & Hall, Ltd., 1947.\n[11] K. S. Fu. Sequential Methods in Pattern Recognition and Machine Learning. Academic Press,\n\n1968.\n\n[12] Jan \u02c7Sochman and Ji\u02c7r\u00b4\u0131 Matas. Waldboost \u2014 learning for time constrained sequential detection.\n\nIn CVPR, pages 150\u2013156, 2005.\n\n\f", "award": [], "sourceid": 2981, "authors": [{"given_name": "Christian", "family_name": "Shelton", "institution": null}, {"given_name": "Wesley", "family_name": "Huie", "institution": null}, {"given_name": "Kin", "family_name": "Kan", "institution": null}]}