{"title": "Boxlets: A Fast Convolution Algorithm for Signal Processing and Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 571, "page_last": 577, "abstract": null, "full_text": "Boxlets: a Fast Convolution Algorithm for \n\nSignal Processing and Neural Networks \n\nPatrice Y. Simard\u00b7, Leon Botton, Patrick Haffner and Yann LeCnn \n\npatrice@microsoft.com \n\n{leon b ,haffner ,yann }@research.att.com \n\nAT&T Labs-Research \n\n100 Schultz Drive, Red Bank, NJ 07701-7033 \n\nAbstract \n\nSignal processing and pattern recognition algorithms make exten(cid:173)\nsive use of convolution. In many cases, computational accuracy is \nnot as important as computational speed. In feature extraction, \nfor instance, the features of interest in a signal are usually quite \ndistorted. This form of noise justifies some level of quantization in \norder to achieve faster feature extraction . Our approach consists \nof approximating regions of the signal with low degree polynomi(cid:173)\nals, and then differentiating the resulting signals in order to obtain \nimpulse functions (or derivatives of impulse functions). With this \nrepresentation, convolution becomes extremely simple and can be \nimplemented quite effectively. The true convolution can be recov(cid:173)\nered by integrating the result of the convolution. This method \nyields substantial speed up in feature extraction and is applicable \nto convolutional neural networks. \n\nIntroduction \n\n1 \nIn pattern recognition, convolution is an important tool because of its translation \ninvariance properties. Feature extraction is a typical example: The distance between \na small pattern (i.e. feature) is computed at all positions (i.e. translations) inside a \nlarger one. The resulting \"distance image\" is typically obtained by convolving the \nfeature template with the larger pattern. In the remainder of this paper we will use \nthe terms image and pattern interchangeably (because of the topology implied by \ntranslation invariance). \nThere are many ways to convolve images efficiently. For instance, a multiplication \nof images of the same size in the Fourier domain corresponds to a convolution of \nthe two images in the original space. Of course this requires J{ N log N operations \n(where N is the number of pixels of the image and J{ is a constant) just to go in and \nout of the Fourier domain. These methods are usually not appropriate for feature \nextraction because the feature to be extracted is small with respect to the image. \nFor instance, if the image and the feature have respectively 32 x 32 and 5 x 5 pixels, \n\n\u2022 Now with Microsoft, One Microsoft Way, Redmond, WA 98052 \n\n\f572 \n\nP Y. Simard, L. BOllou, P Haffner and Y. Le Cun \n\nthe full convolution can be done in 25 x 1024 multiply-adds. In contrast, it would \nrequire 2 x J{ x 1024 x 10 to go in and out of the Fourier domain. \nFortunately, in most pattern recognition applications, the interesting features are \nalready quite distorted when they appear in real images. Because of this inherent \nnoise, the feature extraction process can usually be approximated (to a certain de(cid:173)\ngree) without affecting the performance. For example, the result of the convolution \nis often quantized or thresholded to yield the presence and location of distinctive \nfeatures ll]. Because precision is typically not critical at this stage (features are \nrarely optimal, thresholding is a crude operation), it is often possible to quantize \nthe signals before the convolution with negligible degradation of performance. \nThe subtlety lies in choosing a quantization scheme which can speed up the con(cid:173)\nvolution while maintaining the same level of performance. We now introduce the \nconvolution algorithm, from which we will deduce the constraints it imposes on \nquantization. \nThe main algorithm introduced in this paper is based on a fundamental property of \nconvolutions. Assuming that 1 and 9 have finite support and that r denotes the \nn-th integral of 1 (or the n-th derivative if n is negative), we can write the following \nconvolution identity: \n(1) \nwhere * denotes the convolution operator. Note that 1 or 9 are not necessarily \ndifferentiable. For instance, the impulse function (also called Dirac delta function), \ndenoted J, verifies the identity: \n\n(J * g)n = r * 9 = 1 * gn \n\nI*g = r *g-n \n\n(2) \nwhere J~ denotes the n-th integral of the delta function, translated by a (Ja(x) = \nJ(x - a)). Equations 1 and 2 are not new to signal processing. Heckbert has devel(cid:173)\noped an effective filtering algorithm [2] where the filter 9 is a simple combination \nof polynomial of degree n - 1. Convolution between a signal 1 and the filter 9 can \nbe written as \n(3) \nwhere r is the n-th integral of the signal, and the n-th derivative of the filter \n9 can be written exclusively with delta functions (resulting from differentiating \nn - 1 degree polynomials n times). Since convolving with an impulse function is \na trivial operation, the computation of Equation 3 can be carried out effectively. \nUnfortunately, Heckbert's algorithm is limited to simple polynomial filters and is \nonly interesting when the filter is wide and when the Fourier transform is unavailable \n(such as in variable length filters). \nIn contrast, in feature extraction, we are interested in small and arbitrary filters \n(the features). Under these conditions, the key to fast convolution is to quantize \nthe images to combinations of low degree polynomials, which are differentiated, \nconvolved and then integrated. The algorithm is summarized by equation: \n\n1 * 9 ~ F * C = (F- n * C-m)m+n \n\n(4) \nwhere F and C are polynomial approximation of 1 and g, such that F- n and \nC- m can be written as sums of impulse functions and their derivatives. Since the \nconvolution F- n * C- m only involves applying Equation 2, it can be computed quite \neffectively. The computation of the convolution is illustrated in Figure 1. Let 1 \nand 9 be two arbitrary I-dimensional signals (top of the figure). Let's assume that \n1 and 9 can both be approximated by partitions of polynomials, F and C. On \nthe figure , the polynomials are of degree 0 (they are constant), and are depicted in \nthe second line. The details on how to compute F and C will be explained in the \nnext section. In the next step, F and C are differentiated once, yielding successions \nof impulse functions (third line in the figure). The impulse representation has the \nadvantage of having a finite support, and of being easy to convolve. Indeed two \nimpulse functions can be convolved using Equation 2 (4 x 3 = 12 multiply-adds on \nthe figure). Finally the result of the convolution must be integrated twice to yield \n\nF * C = (F- 1 * C- 1 )2 \n\n(5) \n\n\fBoxlets: A Fast Convolution Algorithm \n\n573 \n\nOriginal \n\nQuantization \n\nF \n\n= \n\nDifferentiation -1...' - - - - -L . . - - - r_ - -1 -\n\nG V \n'r-I ------,11-__ ----..I \nt \nG' t \n\nFIG' \n\nFIG \n\nConvolution \n\nDouble \n\nIntegration \n\nFigure 1: Example of convolution between I-dimensional function f and g , where \nthe approximations of f and 9 are piecewise constant . \n\n2 Quantization: from Images to Boxlets \nThe goal of this section is to suggest efficient ways to approximate an image f by \ncover of polynomials of degree d suited for convolution. Let S be the space on \nwhich f is defined , and let C = {cd be a partition of S (Ci n Cj = 0 for i =f. j , \nand Ui Ci = S). For each Ci, let Pi be a polynomial of degree d which minimizes \nequatIOn: \n\n(6) \n\nThe uniqueness of Pi is guaranteed if Ci is convex. The problem is to find a cover \nC which minimizes both the number of Ci and I.:i ei. Many different compromises \nare possible, but since the computational cost of the convolution is proportional \nto the number of regions, it seemed reasonable to chose the largest regions with a \nmaximum error bounded by a threshold K . Since each region will be differentiated \nand integrated along the directions of the axes, the boundaries of the CiS are re(cid:173)\nstricted to be parallel to the axes , hence the appellation boxlet. There are still many \nways to compute valid partitions of boxlets and polynomials. We have investigated \ntwo very different approaches which both yield a polynomial cover of the image in \nreasonable time. The first algorithm is greedy. It uses a procedure which, starting \nfrom a top left corner , finds the biggest boxlet Ci which satisfies ei < K without \noverlapping another boxlet . The algorithm starts with the top left corner of the \nimage, and keeps a list of all possible starting points (uncovered top left corners) \nsorted by X and Y positions. When the list is exhausted, the algorithm terminates. \nSurprisingly, this algorithm can run in O(d(N + PlogN)), where N is the number \nof pixels, P is the number of boxlets and d is the order of the polynomials PiS. \nAnother much simpler algorithm consists of recursively splitting boxlets, starting \nfrom a boxlet which encompass the whole image, until ei < K for all the leaves \nof the tree. This algorithm runs in O(dN) , is much easier to implement, and is \nfaster (better time constant). Furthermore , even though the first algorithm yields \na polynomial coverage with less boxlets, the second algorithm yields less impulse \nfunctions after differentiation because more impulse functions can be combined (see \nnext section). Both algorithms rely on the fact that Equation 6 can be computed \n\n\f574 \n\nP. Y. Simard, L. Bottou, P. Haffner and Y. Le Cun \n\nFigure 2: Effects of boxletization: original (top left), greedy (bottom left) with a \nthreshold of tO,OOO, and recursive (top and bottom right) with a threshold of 10,000. \n\nin constant time. This computation requires the following quantities \n\nL f(x, y), L f(x, y)2 , L f(x, y)x, L f(x, y)y, L f(x, y)xy,... \n\n(7) \n\n~~------~v~------~\"~--------------v-------------~ \n\ndegree 1 \n\ndegree a \n\nto be pre-computed over the whole image, for the greedy algorithm, or over recur(cid:173)\nsively embedded regions, for the recursive algorithm. In the case of the recursive \nalgorithm these quantities are computed bottom up and very efficiently. To prevent \nthe sums to become too large a limit can be imposed on the maximum size of Ci. \nThe coefficients of the polynomials are quickly evaluated by solving a small linear \nsystem using the first two sums for polynomials of degree a (constants), the first 5 \nsums for polynomials of degree 1, and so on. \nFigure 2 illustrates the results of the quantization algorithms. The top left corner \nis a fraction of the original image. The bottom left image illustrates the boxleti(cid:173)\nzation of the greedy algorithm, with polynomials of degree 1, and ei <= 10, 000 \n( 13000 boxlets, 62000 impulse (and its derivative) functions . The top right image \nillustrates the boxletization of the recursive algorithm, with polynomials of degree \no and ei <= 10, 000 ( 47000 boxlets, 58000 impulse functions). The bottom right \nis the same as top right without displaying the boxlet boundaries. In this case the \npixel to impulse function ratio 5.8. \n\n3 Differentiation: from Boxlets to Impulse Functions \nIf Pi is a polynomial of degree d, its (d + 1 )-th derivative can be written as a sum of \nimpulse function's derivatives, which are zero everywhere but at the corners of Ci. \nThese impulse functions summarize the boundary conditions and completely char(cid:173)\nacterize Pi. They can be represented by four (d + 1 )-dimensional vectors associated \nwith the 4 corners of Ci. Figure 3 (top) illustrates the impulse functions at the 4 \n\n\fBoxlets: A Fast Convolution Algorithm \n\n575 \n\nl-~C~) \nYd \u00b71. \n\nenvatlve \n\n(of X derivative) \n\nPolynomial \n(constant) \n\nX derivative \n\nPolynomial covering \n\n(constants) \n\nDerivatives \n\nCombined \n\n~~ \nD D D \n~ \nD D \n~ \nD D \nSorted list \nrepresentation \n\nFigure 3: Differentiation of a constant polynomial in 2D (top). Combining the \nderivative of adjacent polynomials (bottom) \n\ncorners when the polynomial is a constant (degree zero). Note that the polynomial \nmust be differentiated d + 1 times (in this example the polynomial is a constant, \nso d = 0), with respect to each dimension of the input space. This is illustrated at \nthe top of Figure 3. The cover C being a partition, boundary conditions between \nadjacent squares do simplify, that is, the same derivatives of a impulse functions \nat the same location can be combined by adding their coefficients. It is very ad(cid:173)\nvantageous to do so because it will reduce the computation of the convolution in \nthe next step. This is illustrated in Figure 3 (bottom). This combining of impulse \nfunctions is one of the reason why the recurslve algorithm for the quantization is \npreferred to the greedy algorithm. In the recursive algorithm, the boundaries of \nboxlets are often aligned, so that the impulse functions of adjacent boxlets can be \ncombined . Typically, after simplification, there are only 20% more impulse func(cid:173)\ntions than there are boxlets. In contrast, the greedy algorithm generates up to 60% \nmore impulse functions than boxlets, due to the fact that there are no alignment \nconstraints. For the same threshold the recursive algorithm generates 20% to 30% \nless impulse functions than the greedy algorithm. \nFinding which impulse functions can be combined is a difficult task because the \nrecursive representation returned by the recursive algorithm does not provide any \nmeans for matching the bottom of squares on one line, with the top of squares \nfrom below that line. Sorting takes O(P log P) computational steps (where P is the \nnumber of impulse functions) and is therefore too expensive. A better algorithm is \nto visit the recursive tree and accumulate all the top corners into sorted (horizontal) \nlists. A similar procedure sorts all the bottom corners (also into horizontal lists). \nThe horizontal lists corresponding to the same vertical positions can then be merged \nin O(P) operations. The complete algorithm which quantizes an image of N pixels \nand returns sorted lists of impulse functions runs in O(dN) (where d is the degree \nof the polynomials). \n\n4 Results \nThe convolution speed of the algorithm was tested with feature extraction on the \nimage shown on the top left of Figure 2. The image is quantized, but the feature \nis not. The feature is tabulated in kernels of sizes 5 x 5, 10 x 10, 15 x 15 and \n20 x 20 . If the kernel is decomposable, the algorithm can be modified to do two 1D \nconvolutions instead of the present 2D convolution. \nThe quantization of the image is done with constant polynomials, and with thresh(cid:173)\nolds varying from 1,000 to 40,000. This corresponds to varying the pixel to impulse \nfunction ratio from 2.3 to 13.7. Since the feature is not quantized, these ratios \ncorrespond exactly to the ratios of number of multiply-adds for the standard convo(cid:173)\nlution versus the boxlet convolution (excluding quantization and integration). The \n\n\f576 \n\nP Y. Simard, L. Bottou, P Haffner and Y. Le Cun \n\n8.4 \n\n12.5 \n\n13.4 \n\n13.8 \n\nTable 1: Convolution speed-up factors \n\nHorizontal convolution A * \n\nA \n\n* \n\n(a) \n\n(b) \n\nConvolution of runs \n\n---\nI A \n~'\\-\n(C) ~ ?-)-\nld..b-\n-------------- - ----r ------T-------\n(d)~;%)+r: ~ ~ \n\n- - - -\n\n'w w' \n\n\\ . \n\nFigure 4: Run length X convolution \n\nactual speed up factors are summarized in Table 1. The four last columns indicate \nthe measured time ratios between the standard convolution and the boxlet convo(cid:173)\nlution. For each threshold value, the top line indicates the time ratio of standard \nconvolution versus quantization, convolution and integration time for the boxlet \nconvolution. The bottom line does not take into account the quantization time. \nThe feature size was varied from 5 x 5 to 20 x 20. Thus with a threshold of 10,000 \nand a 5 x 5 kernel, the quantization ratio is 5.8, and the speed up factor is 2.8. \nThe loss in image quality can be seen by comparing the top left and the bottom \nright images. If several features are extracted, the quantization time of the image \nis shared amongst the features and the speed up factor is closer to 4.7. \nIt should be noted that these speed up factors depend on the quantization level \nwhich depends on the data and affects the accuracy of the result. The good news is \nthat for each application the optimal threshold (the maximum level of quantization \nwhich has negligible effect on the result) can be evaluated quickly. Once the optimal \nthreshold has been determined, one can enjoy the speed up factor. It is remarkable \nthat with a quantization factor as low as 2.3, the speed up ratio can range from \n1.5 to 2.3, depending on the number of features. We believe that this method is \ndirectly applicable to forward propagation in convolutional neural nets (although \nno results are available at this time) . \nThe next application shows a case where quantization has no adverse effect on the \naccuracy of the convolution, and yet large speed ups are obtained. \n\n\fBoxlets: A Fast Convolution Algorithm \n\n577 \n\n5 Binary images and run-length encoding \nThe quantization steps described in Sections 2 and 3 become particularly simple \nwhen the image is binary. If the threshold is set to zero, and if only the X deriva(cid:173)\ntive is considered, the impulse representation is equivalent to run-length encoding. \nIndeed the position of each positive impulse function codes the beginning of a run, \nwhile the position of each negative impulses code the end of a run. The horizontal \nconvolution can be computed effectively using the boxlet convolution algorithm. \nThis is illustrated in Figure 4. In (a), the distance between two binary images must \nbe evaluated for every horizontal position (horizontal translation invariant distance). \nThe result is obtained by convolving each horizontal line and by computing the sum \nof each of the convolution functions. The convolution of two runs, is depicted in \n(b), while the summation of all the convolutions of two runs is depicted in (c). If \nan impulse representation is used for the runs (a first derivative) , each summation \nof a convolution between two runs requires only 4 additions of impulse functions, \nas depicted in (d). The result must be integrated twice, according to Equation 5. \nThe speed up factors can be considerable depending on the width of the images (an \norder of magnitude if the width is 40 pixels), and there is no accuracy penalty. \n\nFigure 5: Binary image (left) and compact impulse function encoding (right). \n\nThis speed up also generalizes to 2-dimensional encoding of binary images. The gain \ncomes from the frequent cancellations of impulse functions of adjacent boxlets. The \nnumber of impulse functions is proportional to the contour length of the binary \nshapes. In this case, the boxlet computation is mostly an efficient algorithm for \n2-dimensional run-length encoding. This is illustrated in Figure 5. As with run(cid:173)\nlength encoding, a considerable speed up is obtained for convolution, at no accuracy \npenalty cost. \n\n6 Conclusion \nWhen convolutions are used for feature extraction, preCISIon can often be sacri(cid:173)\nficed for speed with negligible degradation of performance. The boxlet convolution \nmethod combines quantization and convolution to offer a continuous adjustable \ntrade-off between accuracy and speed. In some cases (such as in relatively simple \nbinary images) large speed ups can come with no adverse effects. The algorithm is \ndirectly applicable to the forward propagation in convolutional neural networks and \nin pattern matching when translation invariance results from the use of convolution. \n\nReferences \n[1] Yann LeCun and Yoshua Bengio, \"Convolutional networks for images, speech, \nand time-series,\" in The Han'dbook of Brain Theory and Neural Networks, M. A. \nArbib, Ed. 1995, MIT Press. \n\n[2] Paul S. Heckbert, \"Filtering by repeated integration,\" \n\nin ACM SIGGRAPH \nconference on Computer graphics, Dallas, TX , August 1986, vol. 20, pp. 315-\n321. \n\n\f", "award": [], "sourceid": 1602, "authors": [{"given_name": "Patrice", "family_name": "Simard", "institution": null}, {"given_name": "L\u00e9on", "family_name": "Bottou", "institution": null}, {"given_name": "Patrick", "family_name": "Haffner", "institution": null}, {"given_name": "Yann", "family_name": "LeCun", "institution": null}]}