{"title": "Deep Neural Networks for Object Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 2553, "page_last": 2561, "abstract": "Deep Neural Networks (DNNs) have recently shown outstanding performance on the task of whole image classification. In this paper we go one step further and address the problem of object detection -- not only classifying but also precisely localizing objects of various classes using DNNs. We present a simple and yet powerful formulation of object detection as a regression to object masks. We define a multi-scale inference procedure which is able to produce a high-resolution object detection at a low cost by a few network applications. The approach achieves state-of-the-art performance on Pascal 2007 VOC.", "full_text": "Deep Neural Networks for Object Detection\n\nChristian Szegedy Alexander Toshev Dumitru Erhan\n\n{szegedy, toshev, dumitru}@google.com\n\nGoogle, Inc.\n\nAbstract\n\nDeep Neural Networks (DNNs) have recently shown outstanding performance on\nimage classi\ufb01cation tasks [14]. In this paper we go one step further and address\nthe problem of object detection using DNNs, that is not only classifying but also\nprecisely localizing objects of various classes. We present a simple and yet pow-\nerful formulation of object detection as a regression problem to object bounding\nbox masks. We de\ufb01ne a multi-scale inference procedure which is able to pro-\nduce high-resolution object detections at a low cost by a few network applications.\nState-of-the-art performance of the approach is shown on Pascal VOC.\n\nIntroduction\n\n1\nAs we move towards more complete image understanding, having more precise and detailed object\nrecognition becomes crucial. In this context, one cares not only about classifying images, but also\nabout precisely estimating estimating the class and location of objects contained within the images,\na problem known as object detection.\nThe main advances in object detection were achieved thanks to improvements in object representa-\ntions and machine learning models. A prominent example of a state-of-the-art detection system is\nthe Deformable Part-based Model (DPM) [9]. It builds on carefully designed representations and\nkinematically inspired part decompositions of objects, expressed as a graphical model. Using dis-\ncriminative learning of graphical models allows for building high-precision part-based models for\nvariety of object classes.\nManually engineered representations in conjunction with shallow discriminatively trained models\nhave been among the best performing paradigms for the related problem of object classi\ufb01cation\nas well [17]. In the last years, however, Deep Neural Networks (DNNs) [12] have emerged as a\npowerful machine learning model.\nDNNs exhibit major differences from traditional approaches for classi\ufb01cation. First, they are deep\narchitectures which have the capacity to learn more complex models than shallow ones [2]. This\nexpressivity and robust training algorithms allow for learning powerful object representations with-\nout the need to hand design features. This has been empirically demonstrated on the challenging\nImageNet classi\ufb01cation task [5] across thousands of classes [14, 15].\nIn this paper, we exploit the power of DNNs for the problem of object detection, where we not only\nclassify but also try to precisely localize objects. The problem we are address here is challenging,\nsince we want to detect a potentially large number object instances with varying sizes in the same\nimage using a limited amount of computing resources.\nWe present a formulation which is capable of predicting the bounding boxes of multiple objects in\na given image. More precisely, we formulate a DNN-based regression which outputs a binary mask\nof the object bounding box (and portions of the box as well), as shown in Fig. 1. Additionally,\nwe employ a simple bounding box inference to extract detections from the masks. To increase\nlocalization precision, we apply the DNN mask generation in a multi-scale fashion on the full image\nas well as on a small number of large image crops, followed by a re\ufb01nement step (see Fig. 2).\n\n1\n\n\fIn this way, only through a few dozen DNN-regressions we can achieve state-of-art bounding box\nlocalization.\nIn this paper, we demonstrate that DNN-based regression is capable of learning features which\nare not only good for classi\ufb01cation, but also capture strong geometric information. We use the\ngeneral architecture introduced for classi\ufb01cation by [14] and replace the last layer with a regression\nlayer. The somewhat surprising but powerful insight is that networks which to some extent encode\ntranslation invariance, can capture object locations as well.\nSecond, we introduce a multi-scale box inference followed by a re\ufb01nement step to produce precise\ndetections. In this way, we are able to apply a DNN which predicts a low-resolution mask, limited\nby the output layer size, to pixel-wise precision at a low cost \u2013 the network is a applied only a few\ndozen times per input image.\nIn addition, the presented method is quite simple. There is no need to hand design a model which\ncaptures parts and their relations explicitly. This simplicity has the advantage of easy applicability\nto wide range of classes, but also show better detection performance across a wider range of objects\n\u2013 rigid ones as well as deformable ones. This is presented together with state-of-the-art detection\nresults on Pascal VOC challenge [7] in Sec. 7.\n\n2 Related Work\n\nOne of the most heavily studied paradigms for object detection is the deformable part-based model,\nwith [9] being the most prominent example. This method combines a set of discriminatively trained\nparts in a star model called pictorial structure. It can be considered as a 2-layer model \u2013 parts being\nthe \ufb01rst layer and the star model being the second layer. Contrary to DNNs, whose layers are generic,\nthe work by [9] exploits domain knowledge \u2013 the parts are based on manually designed Histogram\nof Gradients (HOG) descriptors [4] and the structure of the parts is kinematically motivated.\nDeep architectures for object detection and parsing have been motivated by part-based models and\ntraditionally are called compositional models, where the object is expressed as layered composition\nof image primitives. A notable example is the And/Or graph [20], where an object is modeled\nby a tree with And-nodes representing different parts and Or-nodes representing different modes\nof the same part. Similarly to DNNs, the And/Or graph consists of multiple layers, where lower\nlayers represent small generic image primitives, while higher layers represent object parts. Such\ncompositional models are easier to interpret than DNNs. On the other hand, they require inference\nwhile the DNN models considered in this paper are purely feed-forward with no latent variables to\nbe inferred.\nFurther examples of compositional models for detection are based on segments as primitives [1],\nfocus on shape [13], use Gabor \ufb01lters [10] or larger HOG \ufb01lters [19]. These approaches are tra-\nditionally challenged by the dif\ufb01culty of training and use specially designed learning procedures.\nMoreover, at inference time they combine bottom-up and top-down processes.\nNeural networks (NNs) can be considered as compositional models where the nodes are more\ngeneric and less interpretable than the above models. Applications of NNs to vision problems are\ndecades old, with Convolutional NNs being the most prominent example [16].\nIt was not until\nrecently than these models emerged as highly successful on large-scale image classi\ufb01cation tasks\n[14, 15] in the form of DNNs. Their application to detection, however, is limited. Scene parsing,\nas a more detailed form of detection, has been attempted using multi-layer Convolutional NNs [8].\nSegmentation of medical imagery has been addressed using DNNs [3]. Both approaches, however,\nuse the NNs as local or semi-local classi\ufb01ers either over superpixels or at each pixel location. Our\napproach, however, uses the full image as an input and performs localization through regression. As\nsuch, it is a more ef\ufb01cient application of NNs.\nPerhaps the closest approach to ours is [18] which has similar high level objective but use much\nsmaller network with a different features, loss function and without a machinery to distinguish be-\ntween multiple instances of the same class.\n\n2\n\n\fFigure 1: A schematic view of object detection as DNN-based regression.\n\nFigure 2: After regressing to object masks across several scales and large image boxes, we perform\nobject box extraction. The obtained boxes are re\ufb01ned by repeating the same procedure on the sub\nimages, cropped via the current object boxes. For brevity, we display only the full object mask,\nhowever, we use all \ufb01ve object masks.\n\n3 DNN-based Detection\n\nThe core of our approach is a DNN-based regression towards an object mask, as shown in Fig. 1.\nBased on this regression model, we can generate masks for the full object as well as portions of\nthe object. A single DNN regression can give us masks of multiple objects in an image. To further\nincrease the precision of the localization, we apply the DNN localizer on a small set of large sub-\nwindows. The full \ufb02ow is presented in Fig. 2 and explained below.\n\n4 Detection as DNN Regression\n\nOur network is based on the convolutional DNN de\ufb01ned by [14]. It consists of total 7 layers, the\n\ufb01rst 5 of which being convolutional and the last 2 fully connected. Each layer uses a recti\ufb01ed linear\nunit as a non-linear transformation. Three of the convolutional layers have in addition max pooling.\nFor further details, we refer the reader to [14].\nWe adapt the above generic architecture for localization. Instead of using a softmax classi\ufb01er as a\nlast layer, we use a regression layer which generates an object binary mask DN N (x; \u0398) \u2208 RN ,\nwhere \u0398 are the parameters of the network and N is the total number of pixels. Since the output\nof the network has a \ufb01xed dimension, we predict a mask of a \ufb01xed size N = d \u00d7 d. After being\nresized to the image size, the resulting binary mask represents one or several objects: it should have\nvalue 1 at particular pixel if this pixel lies within the bounding box of an object of a given class and\n0 otherwise.\nThe network is trained by minimizing the L2 error for predicting a ground truth mask m \u2208 [0, 1]N\nfor an image x:\n\n||(Diag(m) + \u03bbI)1/2(DN N (x; \u0398) \u2212 m)||2\n2,\n\n(cid:88)\n\nmin\n\n\u0398\n\n(x,m)\u2208D\n\nwhere the sum ranges over a training set D of images containing bounding boxed objects which are\nrepresented as binary masks.\nSince our base network is highly non-convex and optimality cannot be guaranteed, it is sometimes\nnecessary to regularize the loss function by using varying weights for each output depending on the\n\n3\n\n...maskregressionlayerDBNfull object maskleft object masktop object mask......DNNDNNobject box extractionobject box extractionre\ufb01nescale 1scale 2small set of boxes covering imagemerged object masks \fground truth mask. The intuition is that most of the objects are small relative to the image size and\nthe network can be easily trapped by the trivial solution of assigning a zero value to every output.\nTo avoid this undesirable behavior, it is helpful to increase the weight of the outputs corresponding\nto non-zero values in the ground truth mask by a parameter \u03bb \u2208 R+. If \u03bb is chosen small, then the\nerrors on the output with groundtruth value 0 are penalized signi\ufb01cantly less than those with 1 and\ntherefore encouraging the network to predict nonzero values even if the signals are weak.\nIn our implementation, we used networks with a receptive \ufb01eld of 225 \u00d7 225 and outputs predicting\na mask of size d \u00d7 d for d = 24.\n\n5 Precise Object Localization via DNN-generated Masks\nAlthough the presented approach is capable of generating high-quality masks, there are several\nadditional challenges. First, a single object mask might not be suf\ufb01cient to disambiguate objects\nwhich are placed next to each other. Second, due to the limits in the output size, we generate masks\nthat are much smaller than the size of the original image. For example, for an image of size 400\u00d7400\nand d = 24, each output would correspond to a cell of size 16 \u00d7 16 which would be insuf\ufb01cient to\nprecisely localize an object, especially if it is a small one. Finally, since we use as an input the full\nimage, small objects will affect very few input neurons and thus will be hard to recognize. In the\nfollowing, we explain how we address these issues.\n5.1 Multiple Masks for Robust Localization\nTo deal with multiple touching objects, we generate not one but several masks, each representing\neither the full object or part of it. Since our end goal is to produce a bounding box, we use one\nnetwork to predict the object box mask and four additional networks to predict four halves of the\nbox: bottom, top, left and right halves, all denoted by mh, h \u2208 {full, bottom, top, left, left}. These\n\ufb01ve predictions are over-complete but help reduce uncertainty and deal with mistakes in some of the\nmasks. Further, if two objects of the same type are placed next to each other, then at least two of the\nproduced \ufb01ve masks would not have the objects merged which would allow to disambiguate them.\nThis would enable the detection of multiple objects.\nAt training time, we need to convert the object box to these \ufb01ve masks. Since the masks can be\nmuch smaller than the original image, we need to downsize the ground truth mask to the size of the\nnetwork output. Denote by T (i, j) the rectangle in the image for which the presence of an object is\nd (j\u22121))\npredicted by output (i, j) of the network. This rectangle has upper left corner at ( d1\nd , where d is the size of the output mask and d1, d2 the height and width of the\nand has size d1\nimage. During training we assign as value m(i, j) to be predicted as portion of T (i, j) being covered\nby box bb(h) :\n\nd (i\u22121), d2\n\nd \u00d7 d1\n\nmh(i, j; bb) =\n\narea(bb(h) \u2229 T (i, j))\n\narea(T (i, j))\n\n(1)\n\nwhere bb(full) corresponds to the ground truth object box. For the remaining values of h, bb(h)\ncorresponds to the four halves of the original box.\nNote that we use the full object box as well as the top, bottom, left and right halves of the box to\nde\ufb01ne total \ufb01ve different coverage types. The resulting mh(bb) for groundtruth box bb are being\nused at training time for network of type h.\nAt this point, it should be noted that one could train one network for all masks where the output\nlayer would generate all \ufb01ve of them. This would enable scalability. In this way, the \ufb01ve localizers\nwould share most of the layers and thus would share features, which seems natural since they are\ndealing with the same object. An even more aggressive approach \u2014 using the same localizer for a\nlot of distinct classes \u2013 seems also workable.\n5.2 Object Localization from DNN Output\nIn order to complete the detection process, we need to estimate a set of bounding boxes for each\nimage. Although the output resolution is smaller than the input image, we rescale the binary masks\nto the resolution as the input image. The goal is to estimate bounding boxes bb = (i, j, k, l)\nparametrized by their upper-left corner (i, j) and lower-right corner (k, l) in output mask coordi-\nnates.\n\n4\n\n\fTo do this, we use a score S expressing an agreement of each bounding box bb with the masks and\ninfer the boxes with highest scores. A natural agreement would be to measure what portion of the\nbounding box is covered by the mask:\n\nS(bb, m) =\n\n1\n\nm(i, j)area(bb \u2229 T (i, j))\n\n(2)\n\n(cid:88)\n\narea(bb)\n\n(i,j)\n\nwhere we sum over all network outputs indexed by (i, j) and denote by m = DN N (x) the output\nof the network. If we expand the above score over all \ufb01ve mask types, then \ufb01nal score reads:\n\n(S(bb(h), mh) \u2212 S(bb(\u00afh), mh))\n\n(3)\n\n(cid:88)\n\nS(bb) =\n\nh\u2208halves\n\nwhere halves = {full, bottom, top, left, left} index the full box and its four halves. For h denoting\none of the halves \u00afh denotes the opposite half of h, e.g. a top mask should be well covered by a top\nmask and not at all by the bottom one. For h = full, we denote by \u00afh a rectangular region around bb\nwhose score will penalize if the full masks extend outside bb. In the above summation, the score for\na box would be large if it is consistent with all \ufb01ve masks.\nWe use the score from Eq. (3) to exhaustively search in the set of possible bounding boxes. We\nconsider bounding boxes with mean dimension equal to [0.1, . . . , 0.9] of the mean image dimension\nand 10 different aspect ratios estimated by k-means clustering of the boxes of the objects in the\ntraining data. We slide each of the above 90 boxes using stride of 5 pixels in the image. Note that\nthe score from Eq. (3) can be ef\ufb01ciently computed using 4 operations after the integral image of the\nmask m has been computed. The exact number of operations is 5(2 \u00d7 #pixels + 20 \u00d7 #boxes),\nwhere the \ufb01rst term measures the complexity of the integral mask computation while the second\naccounts for box score computation.\nTo produce the \ufb01nal set of detections we perform two types of \ufb01ltering. The \ufb01rst is by keeping boxes\nwith strong score as de\ufb01ned by Eq. (2), e.g. larger than 0.5. We further prune them by applying a\nDNN classi\ufb01er by [14] trained on the classes of interest and retaining the positively classi\ufb01ed ones\nw.r.t to the class of the current detector. Finally, we apply non-maximum suppression as in [9].\n\n5.3 Multi-scale Re\ufb01nement of DNN Localizer\nThe issue with insuf\ufb01cient resolution of the network output is addressed in two ways: (i) applying\nthe DNN localizer over several scales and a few large sub-windows; (ii) re\ufb01nement of detections by\napplying the DNN localizer on the top inferred bounding boxes (see Fig. 2).\nUsing large windows at various scales, we produce several masks and merge them into higher res-\nolution masks, one for each scale. The range of the suitable scales depends on the resolution of the\nimage and the size of the receptive \ufb01eld of the localizer - we want the image be covered by network\noutputs which operate at a higher resolution, while at the same time we want each object to fall\nwithin at least one window and the number of these windows to be small.\nTo achieve the above goals, we use three scales: the full image and two other scales such that the\nsize of the window at a given scale is half of the size of the window at the previous scale. We cover\nthe image at each scale with windows such that these windows have a small overlap \u2013 20% of their\narea. These windows are relatively small in number and cover the image at several scales. Most\nimportantly, the windows at the smallest scale allow localization at a higher resolution.\nAt inference time, we apply the DNN on all windows. Note that it is quite different from sliding\nwindow approaches because we need to evaluate a small number of windows per image, usually\nless than 40. The generated object masks at each scale are merged by maximum operation. This\ngives us three masks of the size of the image, each \u2018looking\u2019 at objects of different sizes. For each\nscale, we apply the bounding box inference from Sec. 5.2 to arrive at a set of detections. In our\nimplementation, we took the top 5 detections per scale, resulting in a total of 15 detections.\nTo further improve the localization, we go through a second stage of DNN regression called re\ufb01ne-\nment. The DNN localizer is applied on the windows de\ufb01ned by the initial detection stage \u2013 each of\nthe 15 bounding boxes is enlarged by a factor of 1.2 and is applied to the network. Applying the\nlocalizer at higher resolution increases the precision of the detections signi\ufb01cantly.\n\n5\n\n\fThe complete algorithm is outlined in Algorithm 1.\n\nAlgorithm 1: Overall algorithm: multi-scale DNN-based localization and subsequent re\ufb01nement.\nThe above algorithm is applied for each object class separately.\nInput: x input image of size; networks DN N h producing full and partial object box mask.\nOutput: Set of detected object bounding boxes with con\ufb01dence scores.\ndetections \u2190 \u2205\nscales \u2190 compute suitable scales for image.\nfor s \u2208 scales do\n\nwindows \u2190 generate windows for the given scale s.\nfor w \u2208 windows do\n\nfor h \u2208 {lower, upper, top, bottom, f ull} do\n\nw \u2190 DN N h(w)\nmh\n\nend\n\nend\nmh \u2190 merge masks mh\ndetectionss \u2190 obtain a set of bounding boxes with scores from mh as in Sec. 5.2\ndetections \u2190 detections \u222a detectionss\n\nw, w \u2208 windows\n\nend\nref ined \u2190 \u2205\nfor d \u2190 detections do\n\nc \u2190 cropped image for enlarged bounding box of d\nfor h \u2208 {lower, upper, top, bottom, f ull} do\n\nw \u2190 DN N h(c)\nmh\n\nend\ndetection \u2190 infer highest scoring bounding box from mh as in Sec. 5.2\nref ined \u2190 ref ined \u222a detection\n\nend\nreturn ref ined\n\n6 DNN Training\nOne of the compelling features of our network is its simplicity: the classi\ufb01er is simply replaced by\na mask generation layer without any smoothness prior or convolutional structure. However, it needs\nto be trained with a huge amount of training data: objects of different sizes need to occur at almost\nevery location.\nFor training the mask generator, we generate several thousand samples from each image divided\ninto 60% negative and 40% positive samples. A sample is considered to be negative if it does not\nintersect the bounding box of any object of interest. Positive samples are those covering at least 80%\nof the area of some of the object bounding boxes. The crops are sampled such that their width is\ndistributed uniformly between the prescribed minimum scale and the width of the whole image.\nWe use similar preparations steps to train the classi\ufb01er used for the \ufb01nal pruning of our detections.\nAgain, we sample several thousand samples from each image: 60% negative and 40% positive\nsamples. The negative samples are those whose bounding boxes have less than 0.2 Jaccard-similarity\nwith any of the groundtruth object boxes The positive samples must have at least 0.6 similarity with\nsome of the object bounding boxes and are labeled by the class of the object with most similar\nbounding box to the crop. Adding the extra negative class acts as a regularizer and improves the\nquality of the \ufb01lters. In both cases, the total number of samples is chosen to be ten million for each\nclass.\nSince training for localization is harder than classi\ufb01cation, it is important to start with the weights\nof a model with high quality low-level \ufb01lters. To achieve this, we \ufb01rst train the network for classi\ufb01-\ncation and reuse the weights of all layers but the classi\ufb01er for localization. For localization, we we\nhave \ufb01ne-tuned the whole network, including the convolutional layers.\nThe networks were trained by stochastic gradient using ADAGRAD [6] to estimate the learning rate\nof the layers automatically.\n\n6\n\n\fclass\nDetectorNet1\nSliding windows1\n3-layer model [19]\nFelz. et al. [9]\nGirshick et al. [11]\nclass\nDetectorNet1\nSliding windows1\n3-layer model [19]\nFelz. et al. [9]\nGirshick et al. [11]\n\naero\n.292\n.213\n.294\n.328\n.324\ntable\n.302\n.110\n.252\n.259\n.257\n\nbicycle\n.352\n.190\n.558\n.568\n.577\ndog\n.282\n.134\n.125\n.088\n.116\n\nboat\nbird\n.194\n.167\n.120\n.068\n.143\n.094\n.168\n.025\n.107\n.157\nhorse m-bike\n.417\n.466\n.220\n.243\n.384\n.504\n.412\n.492\n.556\n.475\n\nbottle\n.037\n.058\n.286\n.285\n.253\nperson\n.262\n.173\n.366\n.368\n.435\n\nbus\n.532\n.294\n.440\n.397\n.513\nplant\n.103\n.070\n.151\n.146\n.145\n\ncar\n.502\n.237\n.513\n.516\n.542\nsheep\n.328\n.118\n.197\n.162\n.226\n\ncat\n.272\n.101\n.213\n.213\n.179\nsofa\n.268\n.166\n.251\n.244\n.342\n\nchair\n.102\n.059\n.200\n.179\n.210\ntrain\n.398\n.240\n.368\n.392\n.442\n\ncow\n.348\n.131\n.193\n.185\n.240\ntv\n.470\n.119\n.393\n.391\n.413\n\nTable 1: Average precision on Pascal VOC2007 test set.\n\nFigure 3: For each image, we show two heat maps on the right: the \ufb01rst one corresponds to the\noutput of DN Nfull, while the second one encodes the four partial masks in terms of the strength of\nthe colors red, green, blue and yellow. In addition, we visualize the estimated object bounding box.\nAll examples are correct detections with exception of the examples in the last row.\n\n7 Experiments\nDataset: We evaluate the performance of the proposed approach on the test set of the Pascal Visual\nObject Challenge (VOC) 2007 [7]. The dataset contains approx. 5000 test images over 20 classes.\nSince our approach has large number of parameters, we train on the VOC2012 training and vali-\ndation set which has approx. 11K images. At test time an algorithm produces for an image a set\nof detections, de\ufb01ned bounding boxes and their class labels. We use precision-recall curves and\naverage precision (AP) per class to measure the performance of the algorithm.\nEvaluation: The complete evaluation on VOC2007 test is given in Table 1. We compare our\napproach, named DetectorNet, to three related approaches. The \ufb01rst is a sliding window version\nof a DNN classi\ufb01er by [14]. After training this network as a 21-way classi\ufb01er (VOC classes and\nbackground), we generate bounding boxes with 8 different aspect ration and at 10 different scales\npaced 5 pixels apart. The smallest scale is 1/10-th of the image size, while the largest covers the\nwhole image. This results in approximately 150, 000 boxes per image. Each box is mapped af\ufb01nely\nto the 225 \u00d7 225 receptive \ufb01eld. The detection score is computed by the softmax classi\ufb01er. We\nreduce the number of the boxes by non-maximum suppression using Jaccard similarity of at least\n\n1Trained on VOC2012 training and validation sets.\n\n7\n\n\fFigure 4: Precision recall curves of DetectorNet after the \ufb01rst stage and after the re\ufb01nement.\n\n0.5 to discard boxes. After the initial training, we performed two rounds of hard negative mining on\nthe training set. This added two million examples to our original training set and has cut down the\nratio of false positives.\nThe second approach is the 3-layer compositional model by [19] which can be considered a deep\narchitecture. As a co-winner of VOC2011 this approach has shown excellent performance. Finally,\nwe compare against the DPM by [9] and [11].\nAlthough our comparison is somewhat unfair, as we trained on the larger VOC2012 training set, we\nshow state-of-the art performance on most of the models: we outperform on 8 classes and perform\non par on other 1. Note that it might be possible to tune the sliding window to perform on par with\nDetectorNet, however the sheer amount of network evaluations makes that approach infeasible while\nDetectorNet requires only (#windows \u00d7 #mask types) \u223c 120 crops per class to be evaluated. On\na 12-core machine, our implementation took about 5-6 secs per image for each class.\nContrary to the widely cited DPM approach by [9], DetectorNet excels at deformable objects such\nas bird, cat, sheep, dog. This shows that it can handle less rigid objects in a better way while working\nwell at the same time on rigid objects such as car, bus, etc.\nWe show examples of the detections in Fig. 3, where both the detected box as well as all \ufb01ve gen-\nerated masks are visualized. It can be seen that the DetectorNet is capable of accurately \ufb01nding\nnot only large but also small objects. The generated masks are well localized and have almost no\nresponse outside the object. Such high-quality detector responses are hard to achieve and in this\ncase are possible because of the expressive power of the DNN and its natural way of incorporating\ncontext.\nThe common misdetections are due to similarly looking objects (left object in last row of Fig. 3)\nor imprecise localization (right object in last row). The latter problem is due to the ambiguous\nde\ufb01nition of object extend by the training data \u2013 in some images only the head of the bird is visible\nwhile in others the full body. In many cases we might observe a detection of both the body and face\nif they are both present in the same image.\nFinally, the re\ufb01nement step contributes drastically to the quality of the detection. This can be seen in\nFig. 4 where we show the precision vs recall of DetectorNet after the \ufb01rst stage of detection and after\nre\ufb01nement. A noticeable improvement can be observed, mainly due to the fact that better localized\ntrue positives have their score boosted.\n\n8 Conclusion\n\nIn this work we leverage the expressivity of DNNs for object detector. We show that the simple\nformulation of detection as DNN-base object mask regression can yield strong results when applied\nusing a multi-scale course-to-\ufb01ne procedure. These results come at some computational cost at\ntraining time \u2013 one needs to train a network per object type and mask type. As a future work we aim\nat reducing the cost by using a single network to detect objects of different classes and thus expand\nto a larger number of classes.\n\n8\n\n00.20.40.600.20.40.60.81birdrecallprecision  DetectorNetDetectorNet \u2212 stage 100.20.40.60.800.20.40.60.81busrecallprecision  DetectorNetDetectorNet \u2212 stage 100.20.40.60.800.20.40.60.8tablerecallprecision  DetectorNetDetectorNet \u2212 stage 1\fReferences\n[1] Narendra Ahuja and Sinisa Todorovic. Learning the taxonomy and models of categories present in arbi-\n\ntrary images. In International Conference on Computer Vision, 2007.\n\n[2] Yoshua Bengio. Learning deep architectures for ai. Foundations and Trends R(cid:13) in Machine Learning,\n\n2(1):1\u2013127, 2009.\n\n[3] Dan Ciresan, Alessandro Giusti, Juergen Schmidhuber, et al. Deep neural networks segment neuronal\nmembranes in electron microscopy images. In Advances in Neural Information Processing Systems 25,\n2012.\n\n[4] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Computer Vision\n\nand Pattern Recognition, 2005.\n\n[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nImageNet: A Large-Scale Hierarchical\n\nImage Database. In Computer Vision and Pattern Recognition, 2009.\n\n[6] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. In Conference on Learning Theory. ACL, 2010.\n\n[7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object\n\nclasses (voc) challenge. International Journal of Computer Vision, 88(2):303\u2013338, 2010.\n\n[8] Cl\u00b4ement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning hierarchical features\nfor scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1915\u20131929,\n2013.\n\n[9] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with\ndiscriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelli-\ngence, 32(9):1627\u20131645, 2010.\n\n[10] Sanja Fidler and Ale\u02c7s Leonardis. Towards scalable representations of object categories: Learning a hier-\n\narchy of parts. In Computer Vision and Pattern Recognition, 2007.\n\n[11] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester. Discriminatively trained deformable part models,\n\nrelease 5. http://people.cs.uchicago.edu/ rbg/latent-release5/.\n\n[12] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural net-\n\nworks. Science, 313(5786):504\u2013507, 2006.\n\n[13] Iasonas Kokkinos and Alan Yuille. Inference and learning with hierarchical shape models. International\n\nJournal of Computer Vision, 93(2):201\u2013225, 2011.\n\n[14] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton.\n\nImagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in Neural Information Processing Systems 25, 2012.\n\n[15] Quoc V Le, Marc\u2019Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S Corrado, Jeff Dean,\nand Andrew Y Ng. Building high-level features using large scale unsupervised learning. In International\nConference on Machine Learning, 2012.\n\n[16] Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time series. The\n\nhandbook of brain theory and neural networks, 1995.\n\n[17] Jorge S\u00b4anchez and Florent Perronnin. High-dimensional signature compression for large-scale image\n\nclassi\ufb01cation. In Computer Vision and Pattern Recognition, 2011.\n\n[18] Hannes Schulz and Sven Behnke. Object-class segmentation using deep convolutional neural networks.\n\nIn Proceedings of the DAGM Workshop on New Challenges in Neural Computation, 2011.\n\n[19] Long Zhu, Yuanhao Chen, Alan Yuille, and William Freeman. Latent hierarchical structural learning for\n\nobject detection. In Computer Vision and Pattern Recognition, 2010.\n\n[20] Song Chun Zhu and David Mumford. A stochastic grammar of images. Computer Graphics and Vision,\n\n2(4):259\u2013362, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1210, "authors": [{"given_name": "Christian", "family_name": "Szegedy", "institution": "Google Research"}, {"given_name": "Alexander", "family_name": "Toshev", "institution": "Google Research"}, {"given_name": "Dumitru", "family_name": "Erhan", "institution": "Google Research"}]}