{"title": "Learning To Count Objects in Images", "book": "Advances in Neural Information Processing Systems", "page_first": 1324, "page_last": 1332, "abstract": "We propose a new supervised learning framework for visual object counting tasks, such as estimating the number of cells in a microscopic image or the number of humans in surveillance video frames. We focus on the practically-attractive case when the training images are annotated with dots (one dot per object). Our goal is to accurately estimate the count. However, we evade the hard task of learning to detect and localize individual object instances. Instead, we cast the problem as that of estimating an image density whose integral over any image region gives the count of objects within that region. Learning to infer such density can be formulated as a minimization of a regularized risk quadratic cost function. We introduce a new loss function, which is well-suited for such learning, and at the same time can be computed efficiently via a maximum subarray algorithm. The learning can then be posed as a convex quadratic program solvable with cutting-plane optimization. The proposed framework is very flexible as it can accept any domain-specific visual features. Once trained, our system provides accurate object counts and requires a very small time overhead over the feature extraction step, making it a good candidate for applications involving real-time processing or dealing with huge amount of visual data.", "full_text": "Learning To Count Objects in Images\n\nVictor Lempitsky\n\nVisual Geometry Group\nUniversity of Oxford\n\nAndrew Zisserman\n\nVisual Geometry Group\nUniversity of Oxford\n\nAbstract\n\nWe propose a new supervised learning framework for visual object counting tasks, such\nas estimating the number of cells in a microscopic image or the number of humans in\nsurveillance video frames. We focus on the practically-attractive case when the training\nimages are annotated with dots (one dot per object).\nOur goal is to accurately estimate the count. However, we evade the hard task of\nlearning to detect and localize individual object instances. Instead, we cast the problem\nas that of estimating an image density whose integral over any image region gives the\ncount of objects within that region. Learning to infer such density can be formulated as\na minimization of a regularized risk quadratic cost function. We introduce a new loss\nfunction, which is well-suited for such learning, and at the same time can be computed\nef\ufb01ciently via a maximum subarray algorithm. The learning can then be posed as a\nconvex quadratic program solvable with cutting-plane optimization.\nThe proposed framework is very \ufb02exible as it can accept any domain-speci\ufb01c visual\nfeatures. Once trained, our system provides accurate object counts and requires a very\nsmall time overhead over the feature extraction step, making it a good candidate for\napplications involving real-time processing or dealing with huge amount of visual data.\n\nIntroduction\n\n1\nThe counting problem is the estimation of the number of objects in a still image or video frame. It arises\nin many real-world applications including cell counting in microscopic images, monitoring crowds in\nsurveillance systems, and performing wildlife census or counting the number of trees in an aerial image\nof a forest.\nWe take a supervised learning approach to this problem, and so require a set of training images with\nannotation. The question is what level of annotation is required? Arguably, the bare minimum of anno-\ntation is to provide the overall count of objects in each training image. This paper focusses on the next\nlevel of annotation which is to specify the object position by putting a single dot on each object instance\nin each image. Figure 1 gives examples of the counting problems and the dotted annotation we consider.\nDotting (pointing) is the natural way to count objects for humans, at least when the number of objects is\nlarge. It may be argued therefore that providing dotted annotations for the training images is no harder\nfor a human than giving just the raw counts. On the other hand, a spatial arrangement of the dots provides\na wealth of additional information, and this paper is, in part, about how to exploit this \u201cfree lunch\u201d (in\nthe context of the counting problem). Overall, it should be noted that dotted annotation is less labour-\nintensive than the bounding-box annotation, let alone pixel-accurate annotation, traditionally used by the\nsupervised methods in the computer vision community [15]. Therefore, the dotted annotation represents\nan interesting and, perhaps, under-investigated case.\nThis paper develops a simple and general discriminative learning-based framework for counting objects\nin images. Similar to global regression methods (see below), it also evades the hard problem of detecting\nall object instances in the images. However, unlike such methods, the approach also takes full and\nextensive use of the spatial information contained in the dotted supervision.\nThe high-level idea of our approach is extremely simple: given an image I, our goal is to recover a\ndensity function F as a real function of pixels in this image. Our notion of density function loosely\n\n1\n\n\fFigure 1: Examples of counting problems. Left \u2014 counting bacterial cells in a \ufb02uorescence-light microscopy\nimage (from [29]), right \u2014 counting people in a surveillance video frame (from [10]). Close-ups are shown along-\nside the images. The bottom close-ups show examples of the dotted annotations (crosses). Our framework learns to\nestimate the number of objects in the previously unseen images based on a set of training images of the same kind\naugmented with dotted annotations.\ncorresponds to the physical notion of density as well as to the mathematical notion of measure. Given\nthe estimate F of the density function and the query about the number of objects in the entire image I, the\nnumber of objects in the image is estimated by integrating F over the entire I. Furthermore, integrating\nthe density over an image subregion S \u2282 I gives an estimate of the count of objects in that subregion.\nOur approach assumes that each pixel p in an image is represented by a feature vector xp and models\nthe density function as a linear transformation of xp: F (p) = wT xp. Given a set of training images,\nthe parameter vector w is learnt in the regularized risk framework, so that the density function estimates\nfor the training images matches the ground truth densities inferred from the user annotations (under\nregularization on w).\nThe key conceptual dif\ufb01culty with the density function is the discrete nature of both image observations\n(pixel grid) and, in particular, the user training annotation (sparse set of dots). As a result, while it is\neasy to reason about average densities over the extended image regions (e.g. the whole image), the notion\nof density is not well-de\ufb01ned at a pixel level. Thus, given a set of dotted annotation there is no trivial\nanswer to the question: what should be the ground truth density for this training example. Consequently,\nthis local ambiguity also renders standard pixel-based distances between density functions inappropriate\nfor the regularized risk framework.\nOur main contribution, addressing this conceptual dif\ufb01culty, is a speci\ufb01c distance metric D between\ndensity functions used as a loss in our framework, which we call the MESA distance (where MESA\nstands for Maximum Excess over SubArrays, as well as for the geological term for the elevated plateau).\nThis distance possess two highly desirable properties:\n1. Robustness. The MESA distance is robust to the additive local perturbations of its arguments such\nas independent noise or high-frequency signal as long as the integrals (counts) of these perturbations\nover larger region are close to zero. Thus, it does not matter much how exactly we de\ufb01ne the ground\ntruth density locally, as long as the integrals of the ground truth density over the larger regions re\ufb02ect the\ncounts correctly. We can then naturally de\ufb01ne the \u201cground truth\u201d density for a dotted annotation to be a\nsum of normalized gaussians centered at the dots.\n2. Computability. The MESA distance can be computed exactly via an ef\ufb01cient combinatorial algo-\nrithm (maximum sub-array [8]). Plugging it into the regularized risk framework then leads to a convex\nquadratic program for estimating w. While this program has a combinatorial number of linear con-\nstraints, the cutting-plane procedure \ufb01nds the close approximation to the globally optimal w after a small\nnumber of iterations.\nThe proposed approach is highly versatile. As virtually no assumptions is made about the features xp,\nour framework can bene\ufb01t from much of the research on good features for object detection. Thus,\nthe con\ufb01dence maps produced by object detectors or the scene explanations resulting from \ufb01tting the\ngenerative models can be turned into features and used by our method.\n\n1.1 Related work.\nA number of approaches tackle counting problems in an unsupervised way, performing grouping based\non self-similarities [3] or motion similarities [27]. However, the counting accuracy of such fully unsu-\npervised methods is limited, and therefore others considered approaches based on supervised learning.\nThose fall into two categories:\n\n2\n\n\fInput: 6 and 10\n\nDetection: 6 and unclear\n\nDensity: 6.52 and 9.37\n\nFigure 2: Processing results for a previously unseen image. Left \u2013 a fragment of the microscopy image. Em-\nphasized are the two rectangles containing 6 and 10 cells respectively. Middle \u2013 the con\ufb01dence map produced by\nan SVM-based detector, 6 peaks are clearly discernible for the 1st rectangle, but the number of peaks in the 2nd\nrectangle is unclear. Right \u2013 the density map, that our approach produces. The integrals over the rectangles (6.52\nand 9.37) are close to the correct number of cells. (MATLAB jet colormap is used)\nCounting by detection: This assumes the use of a visual object detector, that localizes individual object\ninstances in the image. Given the localizations of all instances, counting becomes trivial. However,\nobject detection is very far from being solved [15], especially for overlapping instances. In particular,\nmost current object detectors operate in two stages: \ufb01rst producing a real-valued con\ufb01dence map; and\nsecond, given such a map, a further thresholding and non-maximum suppression steps are needed to\nlocate peaks correspoinding to individual instances [12, 26]. More generative approaches avoid non-\nmaximum suppression by reasoning about relations between object parts and instances [6, 14, 20, 33,\n34], but they are still geared towards a situation with a small number of objects in images and require\ntime-consuming inference. Alternatively, several methods assume that objects tend to be uniform and\ndisconnected from each other by the distinct background color, so that it is possible to localize individual\ninstances via a Monte-Carlo process [13], morphological analysis [5, 29] or variational optimization [25].\nMethods in these groups deliver accurate counts when their underlying assumptions are met but are not\napplicable in more challenging situations.\nCounting by regression: These methods avoid solving the hard detection problem. Instead, a direct\nmapping from some global image characteristics (mainly histograms of various features) to the number\nof objects is learned. Such a standard regression problem can be addressed by a multitude of machine\nlearning tools (e.g. neural networks [11, 17, 22]). This approach however has to discard any available\ninformation about the location of the objects (dots), using only its 1-dimensional statistics (total number)\nfor learning. As a result, a large number of training images with the supplied counts needs to be provided\nduring training. Finally, counting by segmentation methods [10, 28] can be regarded as hybrids of\ncounting-by-detection and counting-by-regression approaches. They segment the objects into separate\nclusters and then regress from the global properties of each cluster to the overall number of objects in it.\n\n2 The Framework\nWe now provide the detailed description of our framework starting with the description of the learning\nsetting and notation.\n\n2.1 Learning to Count\nWe assume that a set of N training images (pixel grids) I1, I2, . . . IN is given. It is also assumed that\np \u2208 RK. We give the\neach pixel p in each image Ii is associated with a real-valued feature vector xi\nexamples of the particular choices of the feature vectors in the experimental section. It is \ufb01nally assumed\nthat each training image Ii is annotated with a set of 2D points Pi = {P1, . . . , PC(i)}, where C(i) is the\ntotal number of objects annotated by the user.\nThe density functions in our approaches are real-valued functions over pixel grids, whose integrals over\nimage regions should match the object counts. For a training image Ii, we de\ufb01ne the ground truth density\nfunction to be a kernel density estimate based on the provided points:\n\n\u2200p \u2208 Ii, F 0\n\nN (p; P, \u03c3212\u00d72) .\n\n(1)\n\n(typically, a few pixels). With this de\ufb01nition, the sum of the ground truth density(cid:80)\n\nHere, p denotes a pixel, N (p; P, \u03c3212\u00d72) denotes a normalized 2D Gaussian kernel evaluated at p,\nwith the mean at the user-placed dot P , and an isotropic covariance matrix with \u03c3 being a small value\ni (p) over the\nF 0\nentire image will not match the dot count Ci exactly, as dots that lie very close to the image boundary\nresult in their Gaussian probability mass being partly outside the image. This is a natural and desirable\n\np\u2208Ii\n\ni (p) = (cid:88)\n\nP\u2208Pi\n\n3\n\n\fbehaviour for most applications, as in many cases an object that lies partly outside the image boundary\nshould not be counted as a full object, but rather as a fraction of an object.\nGiven a set of training images together with their ground truth densities, we aim to learn the linear\ntransformation of the feature representation that approximates the density function at each pixel:\n\n\u2200p \u2208 Ii, Fi(p|w) = wT xi\np ,\n\n(2)\nwhere w \u2208 RK is the parameter vector of the linear transform that we aim to learn from the training\ndata, and Fi(\u00b7|w) is the estimate of the density function for a particular value of w. The regularized risk\nframework then suggests choosing w so that it minimizes the sum of the mismatches between the ground\ntruth and the estimated density functions (the loss function) under regularization:\n\n(cid:32)\n\ni (\u00b7), Fi(\u00b7|w)(cid:1)(cid:33)\nD(cid:0)F 0\n\nN(cid:88)\n\ni=1\n\nw = argmin\n\nwT w + \u03bb\n\nw\n\n,\n\n(3)\n\nHere, \u03bb is a standard scalar hyperparameter, controlling the regularization strength. It is the only hyper-\nparameter in our framework (in addition to those that might be used during feature extraction).\nAfter the optimal weight vector has been learned from the training data, the system can produce a density\nestimate for an unseen image I by a simple linear weighting of the feature vector computed in each pixel\nas suggested by (2). The problem is thus reduced to choosing the right loss function D and computing\nthe optimal w in (3) under that loss.\n\n2.2 The MESA distance\nThe distance D in (3) measures the mismatch between the ground truth and the estimated densities (the\nloss) and has a signi\ufb01cant impact on the performance of the entire learning framework. There are two\nnatural choices for D:\n\u2022 One can choose D to be some function of an LP metric, e.g. the L1 metric (sum of absolute per-pixel\ndifferences) or a square of the L2 metric (sum of squared per-pixel differences). Such choices turns\n(3) into standard regression problems (i.e. support vector regression and ridge regression for L1 and\n2 cases respectively), where each pixel in each training image effectively provides a sample in the\nL2\ntraining set. The problem with such loss is that it is not directly related to the real quantity that we\ncare about, i.e. the overall counts of objects in images. E.g. strong zero-mean noise would affect such\nmetric a lot, while the overall counts would be unaffected.\n\u2022 As the overall counts is what we ultimately care about, one may choose D to be an absolute or\nsquared difference between the overall sums over the entire images for the two arguments, e.g.\np\u2208I F2(p)|. The use of such a pseudometric as a loss turns\n(3) into the counting-by-regression framework discussed in Section 1.1. Once again, we get either\nthe support vector regression (for the absolute differences) or ridge regression (for the squared differ-\nences), but now each training sample corresponds to the entire training image. Thus, although this\nchoice of the loss matches our ultimate goal of learning to count very well, it requires many annotated\nimages for training as spatial information in the annotation is discarded.\n\nD (F1(\u00b7), F2(\u00b7)) = |(cid:80)\n\np\u2208I F1(p) \u2212(cid:80)\n\nGiven the signi\ufb01cant drawbacks of both baseline distance measures, we suggest an alternative, which we\ncall the MESA distance. Given an image I, the MESA distance DMESA between two functions F1(p) and\nF2(p) on the pixel grid is de\ufb01ned as the largest absolute difference between sums of F1(p) and F2(p)\nover all box subarrays in I:\n\nDMESA(F1, F2) = max\nB\u2208B\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:88)\n\np\u2208B\n\nF1(p) \u2212(cid:88)\n\np\u2208B\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nF2(p)\n\n(4)\n\nHere, B is the set of all box subarrays of I.\nThe MESA distance (in fact, a metric) can be regarded as an L\u221e distance between combinatorially-long\nvectors of subarray sums. In the 1D case, it is related to the Kolmogorov-Smirnov distance between\nprobability distributions [23] (in our terminology, the Kolmogorov-Smirnov distance is the maximum of\nabsolute differences over the subarrays with one corner \ufb01xed at top-left; thus the strict subset of B is\nconsidered in the Kolmogorov-Smirnov case).\n\n4\n\n\foriginal\n\nnoise added\n\n\u03c3 increased\n\ndots jittered\n\ndots removed dots reshuf\ufb02ed\n\nFigure 3: Comparison of distances for matching density functions. Here, the top-left image shows one of the\ndensities, computed as the ground truth density for a set of dots. The densities in the top row are obtained through\nsome perturbations of the original one. In the bottom row, we compare side-by-side the per-pixel L1 distance, the\nabsolute difference of overall counts, and the MESA distance between the original and the perturbed densities (the\ndistances are normalized across the 5 examples). The MESA distance has a unique property that it tolerates the local\nmodi\ufb01cations (noise, jitter, change of Gaussian kernel), but reacts strongly to the change in the number of objects\nor their positions. In the middle row we give per-pixel plots of the differences between the respective densities and\nshow the boxes on which the maxima in the de\ufb01nition of the MESA distance are achieved.\n\nThe MESA distance has a number of desirable properties in our framework. Firstly, it is directly related\nto the counting objective we want to optimize. Since the set of all subarrays include the full image,\nDMESA(F1, F2) is an upper bound on the absolute difference of the overall count estimates given by the\ntwo densities F1 and F2. Secondly, when the two density functions differ by a zero-mean high-frequency\nsignal or an independent zero-mean noise, the DMESA distance between them is small, because positive\nand negative deviations of F1 from F2 pixels tend to cancel each other over the large regions. Thirdly,\nDMESA is sensitive to the overall spatial layout of the denisities. Thus, if the difference between F1 and\nF2 is a low-frequency signal, e.g. F1 and F2 are the ground truth densities corresponding to the two point\nsets leaning towards two different corners of the image, then the DMESA distance between F1 and F2 is\nlarge, even if F1 and F2 sum to the same counts over the entire image. These properties are illustrated in\nFigure 3.\nThe \ufb01nal property of DMESA is that it can be computed ef\ufb01ciently. This is because it can be rewritten as:\n\nDMESA(F1, F2) = max\n\n\uf8eb\uf8ed max\n\nB\u2208B\n\n(cid:17)\n\nF1(p) \u2212 F2(p)\n\n(cid:16)\n\n(cid:88)\n\np\u2208B\n\n(cid:16)\n\n(cid:88)\n\np\u2208B\n\n, max\nB\u2208B\n\n(cid:17) \uf8f6\uf8f8 .\n\nF2(p) \u2212 F1(p)\n\n(5)\n\nComputing both inner maxima in (5) then constitutes a 2D maximum subarray problem, which is \ufb01nding\nthe box subarray of a given 2D array with the largest sum. This problem has a number of ef\ufb01cient\nsolutions. Perhaps, the simplest of the ef\ufb01cient ones (from [8]) is an exhaustive search over one image\ndimension (e.g. for the top and bottom dimensions of the optimal subarray) combined with the dynamic\nprogramming (Kadane\u2019s algorithm [7]) to solve the 1D maximum subarray problem along the other\ndimension in the inner loop. This approach has complexity O(|I|1.5), where |I| is the number of pixels\nin the image grid. It can be further improved in practice by replacing the exhaustive search over the \ufb01rst\ndimension with branch-and-bound [4]. More extensive algorithms that guarantee even better worst-case\ncomplexity are known [31]. In our experiments, the algorithm [8] was suf\ufb01cient, as the time bottleneck\nlied in the QP solver (see below).\n\n2.3 Optimization\nWe \ufb01nally discuss how the optimization problem in (3) can be solved in the case when the DMESA distance\nis employed. The learning problem (3) can then be rewritten as a convex quadratic program:\n\n5\n\n\f\u2200i, \u2200B \u2208 Bi :\n\nmin\n\nw,\u03be1,...\u03beN\n\n\u03bei \u2265(cid:88)\n\np\u2208B\n\nsubject to\n\n\u03bei \u2265(cid:88)\n\n(cid:16)\n\np\u2208B\n\n,\n\n(cid:17)\n\nwT xi\n\np \u2212 F 0\n\ni (p)\n\n(6)\n\n(7)\n\ni=1\n\n\u03bei,\n\n(cid:16)\n\nwT w + \u03bb\n\nN(cid:88)\ni (p) \u2212 wT xi\nF 0\n\n(cid:17)\n(cid:0)F 0\ni (\u00b7), Fi(\u00b7| \u02c6w)(cid:1).\n\np\n\nHere, \u03bei are the auxiliary slack variables (one for each training image) and Bi is the set of all subarrays in\nimage i. At the optimum of (6)\u2013(7), the optimal vector \u02c6w is the solution of (3) while the slack variables\nequal the MESA distances: \u02c6\u03bei = DMESA\nThe number of linear constraints in (7) is combinatorial, so that a custom QP-solver cannot be applied\ndirectly. A standard iterative cutting-plane procedure, however, overcomes this problem: one starts with\nonly a small subset of constraints activated (we choose 20 boxes with random dimensions in random\nsubset of images to initialize the process). At each iteration, the QP (6)\u2013(7) is solved with an active\nsubset of constraints. Given the solution jw,j \u03be1, . . .j \u03beN after the iteration j, one can \ufb01nd the box\nsubarrays corresponding to the most violated constraints among (7). To do that, for each image we \ufb01nd\nthe subarrays that maximize the right hand sides of (7), which are exactly the 2D maximum subarrays of\ni (\u00b7) \u2212 Fi(\u00b7|jw) and Fi(\u00b7|jw) \u2212 F 0\nF 0\nThe boxes jB1\n\ni corresponding to these maximum subarrays are then found for each image i. If\nexceed j\u03bei \u00b7 (1 + \u0001),\np\u2208j B1\nthe corresponding constraints are activated, and the next iteration is performed. The iterations terminate\nwhen for all images the sums corresponding to maximum subarrays are within (1 + \u0001) factor from j\u03bei\nIn the derivation here, \u0001 << 1 is a constant that promotes\nand hence no constraints are activated.\nconvergence in a small number of iterations to the approximation of the global minimum. Setting \u0001 to 0\nsolves the program (6)\u2013(7) exactly, while it has been shown in similar circumstances [16] that setting \u0001 to\na small \ufb01nite value does not affect the generalization of the learning algorithm and brings the guarantees\nof convergence in small number of steps.\n\nthe respective sums(cid:80)\n\ni (\u00b7) respectively.\n(cid:17)\n\n(cid:16)jwT xi\n\ni (p) \u2212 jwT xi\nF 0\n\nand(cid:80)\n\ni and jB2\n\np \u2212 F 0\n\np\u2208j B2\n\ni (p)\n\n(cid:16)\n\n(cid:17)\n\np\n\ni\n\ni\n\n3 Experiments\nOur framework and several baselines were evaluated on counting tasks for two types of imagery shown in\nFigure 1. We now discuss the experiments and the quantitative results. The test datasets and the densities\ncomputed with our method can be further assessed qualitatively at the project webpage [1].\nBacterial cells in \ufb02uorescence-light microscopy images. Our \ufb01rst experiment is concerned with syn-\nthetic images, emulating microscopic views of the colonies of bacterial cell, generated with [19] (Fig-\nure 1-left). Such synthetic images (Figure 1-left) are highly realistic and simulate such effects as cell\noverlaps, shape variability, strong out-of-focus blur, vignetting, etc. For the experiments, we generated a\ndataset of images (available at [1]), with the overall number of cells varying between 74 and 317. Few\nannotated datasets with real cell microscopy images also exist. While it is tempting to use real rather\nthan synthetic imagery, all the real image datasets to the best of our knowledge are small (only few\nimages have annotations), and, most importantly, there always are very big discrepancies between the\nannotations of different human experts. The latter effectively invalidates the use of such real datasets for\nquantitative comparison of different counting approaches.\nBelow we discuss the comparison of the counting accuracy achieved by our approach and baseline ap-\nproaches. The features used in all approaches were based on the dense SIFT descriptor [21] computing\nusing [32] software at each pixel of each image with the \ufb01xed SIFT frame radius (about the size of the\ncell) and \ufb01xed orientation. Each algorithm was trained on N training images, while another N images\nwere used for the validation of metaparameters. The following approaches were considered:\n1. The proposed density-based approach. A very simple feature representation was chosen: a codebook\nof K entries was constructed via k-means on SIFT descriptors extracted from the hold-out 20 images.\nThen each pixel is represented by a vector of length K, which is 1 at the dimension corresponding to\nthe entry of the SIFT descriptor at that pixel and 0 for all other dimensions. We used training images to\nlearn the vector w as discussed in Section 2.1. Counting is then performed by summing the values wt\nassigned to the codebook entries t for all pixels in the test image. Figure 2-right gives an example of the\nrespective density (see also [1]).\n\n6\n\n\f6.4\u00b10.7\nlinear ridge regression\nkernel ridge regression counting 60.4\u00b116.5 38.7\u00b117.0 18.6\u00b15.0 10.4\u00b12.5 6.0\u00b10.8\n\nValidation N = 1\ncounting 67.3\u00b125.2 37.7\u00b114.0 16.7\u00b13.1 8.8\u00b11.5\n\nN = 2 N = 4 N = 8 N = 16 N = 32\n5.9\u00b10.5\n5.2\u00b10.3\ncounting 28.0\u00b120.6 20.8\u00b15.8 13.6\u00b11.5 10.2\u00b11.9 10.4\u00b11.2 8.5\u00b10.5\n20.8\u00b13.8 20.1\u00b15.5 15.7\u00b12.0 15.0\u00b14.1 11.8\u00b13.1 12.0\u00b10.8\ndetection\n4.9\u00b10.5\ncounting\ncounting 12.7\u00b17.3 7.8\u00b13.7 5.0\u00b10.5 4.6\u00b10.6 4.2\u00b10.4 3.6\u00b10.2\n6.3\u00b11.2 4.9\u00b10.6 4.9\u00b10.7 3.8\u00b10.2 3.5\u00b10.2\nMESA\n\n22.6\u00b15.3 16.8\u00b16.5 6.8\u00b11.2\n\n9.5\u00b16.1\n\n6.1\u00b11.6\n\n\u2013\n\ndetection\ndetection\n\ndetection+correction\n\ndensity learning\ndensity learning\n\nTable 1: Mean absolute errors for cell counting on the test set of 100 \ufb02uorescent microscopy images. The\nrows correspond to the methods described in the text. The second column corresponds to the error measure used\nfor learning meta-parameters on the validation set. The last 6 columns correspond to the numbers of images in\nthe training and validation sets. The average number of cells is 171\u00b164 per image. Standard deviations in the\ntable correspond to 5 different draws of training and validation image sets. The proposed method (density learning)\noutperforms considerably the baseline approaches (including the application-speci\ufb01c baseline with the error rate =\n16.2) for all sizes of the training set.\n\n\u2019maximal\u2019 \u2019downscale\u2019 \u2019upscale\u2019 \u2019minimal\u2019\n\n2.78\n2.52\n1.84\n1.59\n\nN/A\n4.46\n1.31\n2.02\n\n\u2019dense\u2019\nN/A\nN/A\nN/A\n\n\u2019sparse\u2019\n\nN/A\nN/A\nN/A\n\nCounting-by-Regression [17]\nCounting-by-Regression [28]\nCounting-by-Segmentation [28]\n\n2.07\n1.80\n1.53\n1.70\n\n2.66\n2.34\n1.64\n1.28\n\nDensity learning\n\n1.78\u00b10.39 2.06\u00b10.59\nTable 2: Mean absolute errors for people counting in the surveillance video [10]. The columns correspond to\nthe four scenarios (splits) reproduced from [28] (\u2019maximal\u2019,\u2019downscale\u2019,\u2019upscale\u2019,\u2019minimal\u2019) and for the two new\nsets of splits (\u2019dense\u2019 and \u2019sparse\u2019). Our method outperforms counting-by-regression methods and is competitive\nwith the hybrid method in [28], which uses more detailed annotation.\n\n2. The counting-by-regression baseline. Each of the training images was described by a global histogram\nof the entries occurrences for the same codebook as above. We then learned two types of regression (ridge\nregression with linear and Gaussian kernels) to the number of cells in the image.\n3. The counting-by-detection baseline. We trained a detector based on a linear SVM classi\ufb01er. The SIFT\ndescriptors corresponding to the dotted pixels were considered positive examples. To sample negative\nexamples, we built a Delaunay triangulation on the dots and took SIFT descriptors corresponding to the\npixels at the middle of Delaunay edges. At detection time, we applied the SVM at each pixel, and then\nfound peaks in the resulting con\ufb01dence map (e.g. Figure 2-middle) via non-maximum suppression with\nthe threshold \u03c4 and radius \u03c1 using the code [18]. We also considered a variant with the linear correction of\nthe obtained number to account for systematic biases (detection+correction). The slope and the intercept\nof the correction for each combination of \u03c4, \u03c1, and regularization strength were estimated via robust\nregression on the union of the training and validation sets.\n4. Application-speci\ufb01c method [29]. We also evaluated the software speci\ufb01cally designed for analyzing\ncells in \ufb02uorescence-light images [29]. The counting algorithm here is based on adaptive thresholding\nand morphological analysis. For this baseline, we tuned the free parameter (cell division threshold) on\nthe test set, and computed the mean absolute error, which was 16.2.\nThe meta-parameters (K, regularization strengths, Gaussian kernel width for ridge regression, \u03c4 and \u03c1\nfor non-maximum suppression) were learned in each case on the validation set. The objective minimized\nduring the validation was counting accuracy. For counting-by-detection, we also considered optimizing\ndetection accuracy (computed via Hungarian matching with the ground truth), and, for our approach, we\nalso considered minimizing the MESA distance with the ground truth density on the validation set.\nThe results for a different number N of training and validation images are given in Table 1, based on 5\nrandom draws of training and validation sets. A hold out set of 100 images was used for testing. The\nproposed method outperforms the baseline approaches for all sizes of the training set.\nPedestrians in surveillance video. Here we focus on a 2000-frames video dataset [10] from a camera\noverviewing a busy pedestrian street (Figure 1-right). The authors of [10] also provided the dotted ground\ntruth for these frames, the position of the ground plane, and the region of interest, where the counts\nshould be performed. Recently, [28] performed extensive experiments on the dataset and reported the\nperformance of three approaches (two counting-by-regression including [17] and the hybrid approach:\nsplit into blobs, and regress the number for each blob). The hybrid approach in [28] required more\n\n7\n\n\fdetailed annotations than dotting (see [28] for details). For the sake of comparison, we adhered to the\nexperimental protocols described in [28], so that the performance of our method is directly comparable.\nIn particular, 4 train/test splits were suggested in [28]: 1) \u2019maximal\u2019: train on frames 600:5:1400 (in\nMatlab notation) 2) \u2019downscale\u2019: train on frames 1205:5:1600 (the most crowded) 3) \u2019upscale\u2019: train on\nframes 805:5:1100 (the least crowded) 4) \u2019minimal\u2019: train on frames 640:80:1360 (10 frames). Testing is\nperformed on the frames outside the training range. For future reference, we also included two additional\nscenarios (\u2019dense\u2019 and \u2019sparse\u2019) with multiple similar splits in each (permitting variance estimation).\nBoth scenarios are based on splitting the 2000 frames into 5 contiguous chunks of 400 frames. In each\nof the two scenarios, we then performed training on one chunk and testing on the other 4. In the \u201ddense\u201d\nscenario we trained on 80 frames sampled from the training split with uniform spacing, while in the\n\u2019sparse\u2019 scenario, we took just 10 frames.\nExtracting features in this case is more involved as several modalities, namely the image itself, the\ndifference image with the previous frame, and the background subtracted image have to be combined\nto achieve the best performance (a simple median \ufb01ltering was used to estimate the static background\nimage). We used a randomized tree approach similar to [24] to get features combining these modalities.\nThus, we \ufb01rst extracted the primary features in each pixel including the absolute differences with the\nprevious frame and the background, the image intensity, and the absolute values x- and y-derivatives. On\nthe training subset of the smallest \u2019minimal\u2019 split, we then trained a random forest [9] with 5 randomized\ntrees. The training objective was the regression from the appearance of each pixel and its neighborhood\nto the ground truth density. For each pixel at testtime, the random forest performs a series of simple\ntests comparing the value of in the particular primary channel at location de\ufb01ned by a particular offset\nwith the particular threshold, while during forest pretraining the number of the channel, the offset and\nthe threshold are randomized. Given the pretrained forest, each pixel p gets assigned a vector xp of\ndimension equal to the total number of leaves in all trees, with ones corresponding to the leaves in each\nof the \ufb01ve trees the pixel falls into and zeros otherwise. Finally, to account for the perspective distortion,\nwe multiplied xp by the square of the depth of the ground plane at p (provided with the sequence).\nWithin each scenario, we allocated one-\ufb01fth of the training frames to pick \u03bb and the tree depth through\nvalidation via the MESA distance.\nThe quantitative comparison in Table 2, demonstrates the competitiveness of our method.\nOverall comments. In both sets of experiments, we tried two strategies for setting \u03c3 (kernel width in\nthe de\ufb01nition of the ground truth densities): setting \u03c3 = 0 (effectively, the ground truth is then a sum of\ndelta-functions), and setting \u03c3 = 4 (roughly comparable with object half-size in both experiments). In\nthe \ufb01rst case (cells) both strategies gave almost the same results for all N, highlighting the insensitivity\nof our approach to the choice of \u03c3 (see also Figure 3 on that). The results in Table 1 is for \u03c3 = 0. In the\nsecond case (pedestrians), \u03c3 = 4 had an edge over \u03c3 = 0, and the results in Table 2 are for that value.\nAt train time, we observed that the cutting plane algorithm converged in a few dozen iterations (less\nthan 100 for our choice \u0001 = 0.01). The use of a general-purpose quadratic solver [2] meant that the\ntraining times were considerable (from several seconds to few hours depending on the value of \u03bb and\nthe size of the training set). We anticipate a big reduction in training time for the purpose-built solver.\nAt test time, our approach introduces virtually no time overhead over feature extraction. E.g. in the case\nof pedestrians, one can store the value wt computed during learning at each leaf t in each tree, so that\ncounting would require simply \u201cpushing\u201d each pixel down the forest, and summing the resulting wt from\nthe obtained leaves. This can be done in real-time [30].\n4 Conclusion\nWe have presented the general framework for learning to count objects in images. While our ultimate\ngoal is the counting accuracy over the entire image, during the learning our approach is optimizing the\nloss based on the MESA-distance. This loss involves counting accuracy over multiple subarrays of the\nentire image (and not only the entire image itself). We demonstrate that given limited amount of training\ndata, such an approach achieves much higher accuracy than optimizing the counting accuracy over the\nentire image directly (counting-by-regression). At the same time, the fact that we avoid the hard problem\nof detecting and discerning individual object instances, gives our approach an edge over the counting-\nby-detection method in our experiments.\nAcknowledgements. This work is suppoted by EU ERC grant VisRec no. 228180. V. Lempitsky is\nalso supported by Microsoft Research projects in Russia. We thank Prof. Jiri Matas (CTU Prague) for\nsuggesting the detection+correction baseline.\n\n8\n\n\fReferences\n[1] http://www.robots.ox.ac.uk/%7Evgg/research/counting/index.html.\n[2] The MOSEK optimization software. http://www.mosek.com/.\n[3] N. Ahuja and S. Todorovic. Extracting texels in 2.1d natural textures. ICCV, pp. 1\u20138, 2007.\n[4] S. An, P. Peursum, W. Liu, and S. Venkatesh. Ef\ufb01cient algorithms for subwindow search in object detection\n\nand localization. CVPR, pp. 264\u2013271, 2009.\n\n[5] D. Anoraganingrum. Cell segmentation with median \ufb01lter and mathematical morphology operation. Image\n\nAnalysis and Processing, International Conference on, 0:1043, 1999.\n\n[6] O. Barinova, V. Lempitsky, and P. Kohli. On the detection of multiple object instances using Hough transforms.\n\nCVPR, 2010.\n\n[7] J. L. Bentley. Programming pearls: Algorithm design techniques. Comm. ACM, 27(9):865\u2013871, 1984.\n[8] J. L. Bentley. Programming pearls: Perspective on performance. Comm. ACM, 27(11):1087\u20131092, 1984.\n[9] L. Breiman. Random forests. Machine Learning, 45(1):5\u201332, 2001.\n[10] A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos. Privacy preserving crowd monitoring: Counting people\n\nwithout people models or tracking. CVPR, 2008.\n\n[11] S.-Y. Cho, T. W. S. Chow, and C.-T. Leung. A neural-based crowd estimation by hybrid global learning\n\nalgorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 29(4):535\u2013541, 1999.\n\n[12] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for multi-class object layout. ICCV, 2009.\n[13] X. Descombes, R. Minlos, and E. Zhizhina. Object extraction using a stochastic birth-and-death dynamics in\n\ncontinuum. Journal of Mathematical Imaging and Vision, 33(3):347\u2013359, 2009.\n\n[14] L. Dong, V. Parameswaran, V. Ramesh, and I. Zoghlami. Fast crowd segmentation using shape indexing.\n\nICCV, pp. 1\u20138, 2007.\n\n[15] M.\n\nEveringham,\nThe\n\nman.\nhttp://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2009/workshop/index.html.\n\nL. Van Gool,\nPASCAL Visual Object Classes Challenge\n\nI. Williams,\n\nC. K.\n\nJ. Winn,\n\n2009\n\nand A.\n\nZisser-\n(VOC2009) Results.\n\n[16] T. Joachims, T. Finley, and C.-N. J. Yu. Cutting-plane training of structural svms. Machine Learning, 77(1):27\u2013\n\n[17] D. Kong, D. Gray, and H. Tao. A viewpoint invariant approach for crowd counting. ICPR (3), pp. 1187\u20131190,\n\n59, 2009.\n\n2006.\n\n[18] P. D. Kovesi. MATLAB and Octave functions for computer vision and image processing.\n\nSchool\nof Computer Science & Software Engineering, The University of Western Australia. Available from:\nhttp://www.csse.uwa.edu.au/\u223cpk/research/matlabfns/.\n\n[19] A. Lehmussola, P. Ruusuvuori, J. Selinummi, H. Huttunen, and O. Yli-Harja. Computational framework for\nsimulating \ufb02uorescence microscope images with cell populations. IEEE Trans. Med. Imaging, 26(7):1010\u2013\n1016, 2007.\n\n[20] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmenta-\n\ntion. International Journal of Computer Vision, 77(1-3):259\u2013289, 2008.\n\n[21] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer\n\nVision, 60(2):91\u2013110, 2004.\n\n[22] A. N. Marana, S. A. Velastin, L. F. Costa, and R. A. Lotufo. Estimation of crowd density using image process-\n\ning. Image Processing for Security Applications, pp. 1\u20138, 1997.\n\n[23] J. Massey, Frank J. The kolmogorov-smirnov test for goodness of \ufb01t. Journal of the American Statistical\n\nAssociation, 46(253):68\u201378, 1951.\n\nforests. NIPS, pp. 985\u2013992, 2006.\n\ncoloring. MICCAI (1), pp. 101\u2013108, 2006.\n\n[24] F. Moosmann, B. Triggs, and F. Jurie. Fast discriminative visual codebooks using randomized clustering\n\n[25] S. K. Nath, K. Palaniappan, and F. Bunyak. Cell segmentation using coupled level sets and graph-vertex\n\n[26] T. W. Nattkemper, H. Wersing, W. Schubert, and H. Ritter. A neural network architecture for automatic\n\nsegmentation of \ufb02uorescence micrographs. Neurocomputing, 48(1-4):357\u2013367, 2002.\n\n[27] V. Rabaud and S. Belongie. Counting crowded moving objects. CVPR (1), pp. 705\u2013711, 2006.\n[28] D. Ryan, S. Denman, C. Fookes, and S. Sridharan. Crowd counting using multiple local features. DICTA \u201909:\n\nProceedings of the 2009 Digital Image Computing: Techniques and Applications, pp. 81\u201388, 2009.\n\n[29] J. Selinummi, J. Seppala, O. Yli-Harja, and J. A. Puhakka. Software for quanti\ufb01cation of labeled bacteria from\n\ndigital microscope images by automated image analysis. Biotechniques, 39(6):859\u201363, 2005.\n[30] T. Sharp. Implementing decision trees and forests on a GPU. ECCV (4), pp. 595\u2013608, 2008.\n[31] H. Tamaki and T. Tokuyama. Algorithms for the maxium subarray problem based on matrix multiplication.\n\nSODA, pp. 446\u2013452, 1998.\n\nhttp://www.vlfeat.org/, 2008.\n\n[32] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms.\n\n[33] B. Wu, R. Nevatia, and Y. Li. Segmentation of multiple, partially occluded objects by grouping, merging,\n\nassigning part detection responses. CVPR, 2008.\n\n[34] T. Zhao and R. Nevatia. Bayesian human segmentation in crowded situations. CVPR (2), pp. 459\u2013466, 2003.\n\n9\n\n\f", "award": [], "sourceid": 330, "authors": [{"given_name": "Victor", "family_name": "Lempitsky", "institution": null}, {"given_name": "Andrew", "family_name": "Zisserman", "institution": null}]}