{"title": "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 91, "page_last": 99, "abstract": "State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. Code is available at https://github.com/ShaoqingRen/faster_rcnn.", "full_text": "Faster R-CNN: Towards Real-Time Object Detection\n\nwith Region Proposal Networks\n\nShaoqing Ren\u2217 Kaiming He Ross Girshick\n\nJian Sun\n\n{v-shren, kahe, rbg, jiansun}@microsoft.com\n\nMicrosoft Research\n\nAbstract\n\nState-of-the-art object detection networks depend on region proposal algorithms\nto hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5]\nhave reduced the running time of these detection networks, exposing region pro-\nIn this work, we introduce a Region Pro-\nposal computation as a bottleneck.\nposal Network (RPN) that shares full-image convolutional features with the de-\ntection network, thus enabling nearly cost-free region proposals. An RPN is a\nfully-convolutional network that simultaneously predicts object bounds and ob-\njectness scores at each position. RPNs are trained end-to-end to generate high-\nquality region proposals, which are used by Fast R-CNN for detection. With a\nsimple alternating optimization, RPN and Fast R-CNN can be trained to share\nconvolutional features. For the very deep VGG-16 model [19], our detection\nsystem has a frame rate of 5fps (including all steps) on a GPU, while achieving\nstate-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP)\nand 2012 (70.4% mAP) using 300 proposals per image. Code is available at\nhttps://github.com/ShaoqingRen/faster_rcnn.\n\n1\n\nIntroduction\n\nRecent advances in object detection are driven by the success of region proposal methods (e.g., [22])\nand region-based convolutional neural networks (R-CNNs) [6]. Although region-based CNNs were\ncomputationally expensive as originally developed in [6], their cost has been drastically reduced\nthanks to sharing convolutions across proposals [7, 5]. The latest incarnation, Fast R-CNN [5],\nachieves near real-time rates using very deep networks [19], when ignoring the time spent on region\nproposals. Now, proposals are the computational bottleneck in state-of-the-art detection systems.\nRegion proposal methods typically rely on inexpensive features and economical inference schemes.\nSelective Search (SS) [22], one of the most popular methods, greedily merges superpixels based\non engineered low-level features. Yet when compared to ef\ufb01cient detection networks [5], Selective\nSearch is an order of magnitude slower, at 2s per image in a CPU implementation. EdgeBoxes\n[24] currently provides the best tradeoff between proposal quality and speed, at 0.2s per image.\nNevertheless, the region proposal step still consumes as much running time as the detection network.\nOne may note that fast region-based CNNs take advantage of GPUs, while the region proposal meth-\nods used in research are implemented on the CPU, making such runtime comparisons inequitable.\nAn obvious way to accelerate proposal computation is to re-implement it for the GPU. This may be\nan effective engineering solution, but re-implementation ignores the down-stream detection network\nand therefore misses important opportunities for sharing computation.\nIn this paper, we show that an algorithmic change\u2014computing proposals with a deep net\u2014leads\nto an elegant and effective solution, where proposal computation is nearly cost-free given the de-\n\u2217Shaoqing Ren is with the University of Science and Technology of China. This work was done when he\n\nwas an intern at Microsoft Research.\n\n1\n\n\ftection network\u2019s computation. To this end, we introduce novel Region Proposal Networks (RPNs)\nthat share convolutional layers with state-of-the-art object detection networks [7, 5]. By sharing\nconvolutions at test-time, the marginal cost for computing proposals is small (e.g., 10ms per image).\nOur observation is that the convolutional (conv) feature maps used by region-based detectors, like\nFast R-CNN, can also be used for generating region proposals. On top of these conv features, we\nconstruct RPNs by adding two additional conv layers: one that encodes each conv map position\ninto a short (e.g., 256-d) feature vector and a second that, at each conv map position, outputs an\nobjectness score and regressed bounds for k region proposals relative to various scales and aspect\nratios at that location (k = 9 is a typical value).\nOur RPNs are thus a kind of fully-convolutional network (FCN) [14] and they can be trained end-to-\nend speci\ufb01cally for the task for generating detection proposals. To unify RPNs with Fast R-CNN [5]\nobject detection networks, we propose a simple training scheme that alternates between \ufb01ne-tuning\nfor the region proposal task and then \ufb01ne-tuning for object detection, while keeping the proposals\n\ufb01xed. This scheme converges quickly and produces a uni\ufb01ed network with conv features that are\nshared between both tasks.\nWe evaluate our method on the PASCAL VOC detection benchmarks [4], where RPNs with Fast\nR-CNNs produce detection accuracy better than the strong baseline of Selective Search with Fast\nR-CNNs. Meanwhile, our method waives nearly all computational burdens of SS at test-time\u2014the\neffective running time for proposals is just 10 milliseconds. Using the expensive very deep models\nof [19], our detection method still has a frame rate of 5fps (including all steps) on a GPU, and\nthus is a practical object detection system in terms of both speed and accuracy (73.2% mAP on\nPASCAL VOC 2007 and 70.4% mAP on 2012). Code is available at https://github.com/\nShaoqingRen/faster_rcnn.\n\n2 Related Work\n\nSeveral recent papers have proposed ways of using deep networks for locating class-speci\ufb01c or class-\nagnostic bounding boxes [21, 18, 3, 20]. In the OverFeat method [18], a fully-connected (fc) layer\nis trained to predict the box coordinates for the localization task that assumes a single object. The\nfc layer is then turned into a conv layer for detecting multiple class-speci\ufb01c objects. The Multi-\nBox methods [3, 20] generate region proposals from a network whose last fc layer simultaneously\npredicts multiple (e.g., 800) boxes, which are used for R-CNN [6] object detection. Their proposal\nnetwork is applied on a single image or multiple large image crops (e.g., 224\u00d7224) [20]. We discuss\nOverFeat and MultiBox in more depth later in context with our method.\nShared computation of convolutions [18, 7, 2, 5] has been attracting increasing attention for ef\ufb01-\ncient, yet accurate, visual recognition. The OverFeat paper [18] computes conv features from an\nimage pyramid for classi\ufb01cation, localization, and detection. Adaptively-sized pooling (SPP) [7] on\nshared conv feature maps is proposed for ef\ufb01cient region-based object detection [7, 16] and semantic\nsegmentation [2]. Fast R-CNN [5] enables end-to-end detector training on shared conv features and\nshows compelling accuracy and speed.\n\n3 Region Proposal Networks\n\nA Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of\nrectangular object proposals, each with an objectness score.1 We model this process with a fully-\nconvolutional network [14], which we describe in this section. Because our ultimate goal is to share\ncomputation with a Fast R-CNN object detection network [5], we assume that both nets share a\ncommon set of conv layers. In our experiments, we investigate the Zeiler and Fergus model [23]\n(ZF), which has 5 shareable conv layers and the Simonyan and Zisserman model [19] (VGG), which\nhas 13 shareable conv layers.\nTo generate region proposals, we slide a small network over the conv feature map output by the last\nshared conv layer. This network is fully connected to an n \u00d7 n spatial window of the input conv\n1\u201cRegion\u201d is a generic term and in this paper we only consider rectangular regions, as is common for many\n\nmethods (e.g., [20, 22, 24]). \u201cObjectness\u201d measures membership to a set of object classes vs. background.\n\n2\n\n\fFigure 1: Left: Region Proposal Network (RPN). Right: Example detections using RPN proposals\non PASCAL VOC 2007 test. Our method detects objects in a wide range of scales and aspect ratios.\n\nfeature map. Each sliding window is mapped to a lower-dimensional vector (256-d for ZF and 512-d\nfor VGG). This vector is fed into two sibling fully-connected layers\u2014a box-regression layer (reg)\nand a box-classi\ufb01cation layer (cls). We use n = 3 in this paper, noting that the effective receptive\n\ufb01eld on the input image is large (171 and 228 pixels for ZF and VGG, respectively). This mini-\nnetwork is illustrated at a single position in Fig. 1 (left). Note that because the mini-network operates\nin a sliding-window fashion, the fully-connected layers are shared across all spatial locations. This\narchitecture is naturally implemented with an n \u00d7 n conv layer followed by two sibling 1 \u00d7 1 conv\nlayers (for reg and cls, respectively). ReLUs [15] are applied to the output of the n \u00d7 n conv layer.\n\nTranslation-Invariant Anchors\nAt each sliding-window location, we simultaneously predict k region proposals, so the reg layer\nhas 4k outputs encoding the coordinates of k boxes. The cls layer outputs 2k scores that estimate\nprobability of object / not-object for each proposal.2 The k proposals are parameterized relative to\nk reference boxes, called anchors. Each anchor is centered at the sliding window in question, and is\nassociated with a scale and aspect ratio. We use 3 scales and 3 aspect ratios, yielding k = 9 anchors\nat each sliding position. For a conv feature map of a size W \u00d7H (typically \u223c2,400), there are W Hk\nanchors in total. An important property of our approach is that it is translation invariant, both in\nterms of the anchors and the functions that compute proposals relative to the anchors.\nAs a comparison, the MultiBox method [20] uses k-means to generate 800 anchors, which are not\ntranslation invariant. If one translates an object in an image, the proposal should translate and the\nsame function should be able to predict the proposal in either location. Moreover, because the\nMultiBox anchors are not translation invariant, it requires a (4+1)\u00d7800-dimensional output layer,\nwhereas our method requires a (4+2)\u00d79-dimensional output layer. Our proposal layers have an order\nof magnitude fewer parameters (27 million for MultiBox using GoogLeNet [20] vs. 2.4 million for\nRPN using VGG-16), and thus have less risk of over\ufb01tting on small datasets, like PASCAL VOC.\n\nA Loss Function for Learning Region Proposals\nFor training RPNs, we assign a binary class label (of being an object or not) to each anchor. We\nassign a positive label to two kinds of anchors: (i) the anchor/anchors with the highest Intersection-\nover-Union (IoU) overlap with a ground-truth box, or (ii) an anchor that has an IoU overlap higher\nthan 0.7 with any ground-truth box. Note that a single ground-truth box may assign positive labels\nto multiple anchors. We assign a negative label to a non-positive anchor if its IoU ratio is lower than\n0.3 for all ground-truth boxes. Anchors that are neither positive nor negative do not contribute to the\ntraining objective.\nWith these de\ufb01nitions, we minimize an objective function following the multi-task loss in Fast R-\nCNN [5]. Our loss function for an image is de\ufb01ned as:\nLcls (pi, p\u2217\n\nL({pi},{ti}) =\n\np\u2217\ni Lreg (ti, t\u2217\ni ).\n\n(cid:88)\n\ni ) + \u03bb\n\n(1)\n\n1\n\nNcls\n\ni\n\n(cid:88)\n\ni\n\n1\n\nNreg\n\n2For simplicity we implement the cls layer as a two-class softmax layer. Alternatively, one may use logistic\n\nregression to produce k scores.\n\n3\n\ncar:1.000dog:0.997person:0.992person:0.979horse:0.993convfeaturemapintermediatelayer256-d2kscores4kcoordinatesslidingwindowreglayerclslayerkanchorboxesbus:0.996person:0.736boat:0.970person:0.989person:0.983person:0.983person:0.925cat:0.982dog:0.994\fHere, i is the index of an anchor in a mini-batch and pi is the predicted probability of anchor i being\nan object. The ground-truth label p\u2217\ni is 1 if the anchor is positive, and is 0 if the anchor is negative. ti\nis a vector representing the 4 parameterized coordinates of the predicted bounding box, and t\u2217\ni is that\nof the ground-truth box associated with a positive anchor. The classi\ufb01cation loss Lcls is log loss over\ni ) = R(ti \u2212 t\u2217\ntwo classes (object vs. not object). For the regression loss, we use Lreg (ti, t\u2217\ni ) where\nR is the robust loss function (smooth L1) de\ufb01ned in [5]. The term p\u2217\ni Lreg means the regression loss\ni = 1) and is disabled otherwise (p\u2217\nis activated only for positive anchors (p\u2217\ni = 0). The outputs of\nthe cls and reg layers consist of {pi} and {ti} respectively. The two terms are normalized with Ncls\nand Nreg, and a balancing weight \u03bb.3\nFor regression, we adopt the parameterizations of the 4 coordinates following [6]:\n\ntx = (x \u2212 xa)/wa,\nx = (x\u2217 \u2212 xa)/wa,\nt\u2217\n\nty = (y \u2212 ya)/ha,\ny = (y\u2217 \u2212 ya)/ha,\nt\u2217\n\ntw = log(w/wa),\nt\u2217\nw = log(w\u2217/wa),\n\nth = log(h/ha),\nt\u2217\nh = log(h\u2217/ha),\n\nwhere x, y, w, and h denote the two coordinates of the box center, width, and height. Variables\nx, xa, and x\u2217 are for the predicted box, anchor box, and ground-truth box respectively (likewise\nfor y, w, h). This can be thought of as bounding-box regression from an anchor box to a nearby\nground-truth box.\nNevertheless, our method achieves bounding-box regression by a different manner from previous\nfeature-map-based methods [7, 5].\nIn [7, 5], bounding-box regression is performed on features\npooled from arbitrarily sized regions, and the regression weights are shared by all region sizes. In\nour formulation, the features used for regression are of the same spatial size (n \u00d7 n) on the feature\nmaps. To account for varying sizes, a set of k bounding-box regressors are learned. Each regressor\nis responsible for one scale and one aspect ratio, and the k regressors do not share weights. As such,\nit is still possible to predict boxes of various sizes even though the features are of a \ufb01xed size/scale.\n\nOptimization\nThe RPN, which is naturally implemented as a fully-convolutional network [14], can be trained\nend-to-end by back-propagation and stochastic gradient descent (SGD) [12]. We follow the \u201cimage-\ncentric\u201d sampling strategy from [5] to train this network. Each mini-batch arises from a single image\nthat contains many positive and negative anchors. It is possible to optimize for the loss functions of\nall anchors, but this will bias towards negative samples as they are dominate. Instead, we randomly\nsample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled\npositive and negative anchors have a ratio of up to 1:1. If there are fewer than 128 positive samples\nin an image, we pad the mini-batch with negative ones.\nWe randomly initialize all new layers by drawing weights from a zero-mean Gaussian distribution\nwith standard deviation 0.01. All other layers (i.e., the shared conv layers) are initialized by pre-\ntraining a model for ImageNet classi\ufb01cation [17], as is standard practice [6]. We tune all layers of\nthe ZF net, and conv3 1 and up for the VGG net to conserve memory [5]. We use a learning rate\nof 0.001 for 60k mini-batches, and 0.0001 for the next 20k mini-batches on the PASCAL dataset.\nWe also use a momentum of 0.9 and a weight decay of 0.0005 [11]. Our implementation uses Caffe\n[10].\n\nSharing Convolutional Features for Region Proposal and Object Detection\nThus far we have described how to train a network for region proposal generation, without con-\nsidering the region-based object detection CNN that will utilize these proposals. For the detection\nnetwork, we adopt Fast R-CNN [5]4 and now describe an algorithm that learns conv layers that are\nshared between the RPN and Fast R-CNN.\nBoth RPN and Fast R-CNN, trained independently, will modify their conv layers in different ways.\nWe therefore need to develop a technique that allows for sharing conv layers between the two net-\nworks, rather than learning two separate networks. Note that this is not as easy as simply de\ufb01ning\na single network that includes both RPN and Fast R-CNN, and then optimizing it jointly with back-\npropagation. The reason is that Fast R-CNN training depends on \ufb01xed object proposals and it is\n3In our early implementation (as also in the released code), \u03bb was set as 10, and the cls term in Eqn.(1) was\nnormalized by the mini-batch size (i.e., Ncls = 256) and the reg term was normalized by the number of anchor\nlocations (i.e., Nreg \u223c 2, 400). Both cls and reg terms are roughly equally weighted in this way.\n\n4https://github.com/rbgirshick/fast-rcnn\n\n4\n\n\fnot clear a priori if learning Fast R-CNN while simultaneously changing the proposal mechanism\nwill converge. While this joint optimizing is an interesting question for future work, we develop a\npragmatic 4-step training algorithm to learn shared features via alternating optimization.\nIn the \ufb01rst step, we train the RPN as described above. This network is initialized with an ImageNet-\npre-trained model and \ufb01ne-tuned end-to-end for the region proposal task. In the second step, we\ntrain a separate detection network by Fast R-CNN using the proposals generated by the step-1 RPN.\nThis detection network is also initialized by the ImageNet-pre-trained model. At this point the two\nnetworks do not share conv layers. In the third step, we use the detector network to initialize RPN\ntraining, but we \ufb01x the shared conv layers and only \ufb01ne-tune the layers unique to RPN. Now the two\nnetworks share conv layers. Finally, keeping the shared conv layers \ufb01xed, we \ufb01ne-tune the fc layers\nof the Fast R-CNN. As such, both networks share the same conv layers and form a uni\ufb01ed network.\n\nImplementation Details\nWe train and test both region proposal and object detection networks on single-scale images [7,\n5]. We re-scale the images such that their shorter side is s = 600 pixels [5]. Multi-scale feature\nextraction may improve accuracy but does not exhibit a good speed-accuracy trade-off [5]. We also\nnote that for ZF and VGG nets, the total stride on the last conv layer is 16 pixels on the re-scaled\nimage, and thus is \u223c10 pixels on a typical PASCAL image (\u223c500\u00d7375). Even such a large stride\nprovides good results, though accuracy may be further improved with a smaller stride.\nFor anchors, we use 3 scales with box areas of 1282, 2562, and 5122 pixels, and 3 aspect ratios of\n1:1, 1:2, and 2:1. We note that our algorithm allows the use of anchor boxes that are larger than the\nunderlying receptive \ufb01eld when predicting large proposals. Such predictions are not impossible\u2014\none may still roughly infer the extent of an object if only the middle of the object is visible. With\nthis design, our solution does not need multi-scale features or multi-scale sliding windows to predict\nlarge regions, saving considerable running time. Fig. 1 (right) shows the capability of our method\nfor a wide range of scales and aspect ratios. The table below shows the learned average proposal\nsize for each anchor using the ZF net (numbers for s = 600).\n\nanchor\nproposal 188\u00d7111 113\u00d7114\n\n1282, 2:1 1282, 1:1 1282, 1:2 2562, 2:1 2562, 1:1 2562, 1:2 5122, 2:1 5122, 1:1 5122, 1:2\n416\u00d7229 261\u00d7284 174\u00d7332 768\u00d7437 499\u00d7501 355\u00d7715\n\n70\u00d792\n\nThe anchor boxes that cross image boundaries need to be handled with care. During training, we\nignore all cross-boundary anchors so they do not contribute to the loss. For a typical 1000 \u00d7 600\nimage, there will be roughly 20k (\u2248 60\u00d7 40\u00d7 9) anchors in total. With the cross-boundary anchors\nignored, there are about 6k anchors per image for training. If the boundary-crossing outliers are not\nignored in training, they introduce large, dif\ufb01cult to correct error terms in the objective, and training\ndoes not converge. During testing, however, we still apply the fully-convolutional RPN to the entire\nimage. This may generate cross-boundary proposal boxes, which we clip to the image boundary.\nSome RPN proposals highly overlap with each other. To reduce redundancy, we adopt non-\nmaximum suppression (NMS) on the proposal regions based on their cls scores. We \ufb01x the IoU\nthreshold for NMS at 0.7, which leaves us about 2k proposal regions per image. As we will show,\nNMS does not harm the ultimate detection accuracy, but substantially reduces the number of pro-\nposals. After NMS, we use the top-N ranked proposal regions for detection. In the following, we\ntrain Fast R-CNN using 2k RPN proposals, but evaluate different numbers of proposals at test-time.\n\n4 Experiments\nWe comprehensively evaluate our method on the PASCAL VOC 2007 detection benchmark [4].\nThis dataset consists of about 5k trainval images and 5k test images over 20 object categories. We\nalso provide results in the PASCAL VOC 2012 benchmark for a few models. For the ImageNet\npre-trained network, we use the \u201cfast\u201d version of ZF net [23] that has 5 conv layers and 3 fc layers,\nand the public VGG-16 model5 [19] that has 13 conv layers and 3 fc layers. We primarily evalu-\nate detection mean Average Precision (mAP), because this is the actual metric for object detection\n(rather than focusing on object proposal proxy metrics).\nTable 1 (top) shows Fast R-CNN results when trained and tested using various region proposal\nmethods. These results use the ZF net. For Selective Search (SS) [22], we generate about 2k SS\n\n5www.robots.ox.ac.uk/\u02dcvgg/research/very_deep/\n\n5\n\n\fTable 1: Detection results on PASCAL VOC 2007 test set (trained on VOC 2007 trainval). The\ndetectors are Fast R-CNN with ZF, but using various proposal methods for training and testing.\n\ntest-time region proposals\n\nmethod\n\n# proposals\n\nmAP (%)\n\nSS\nEB\n\nRPN+ZF, shared\n\nRPN+ZF, unshared\n\nRPN+ZF\nRPN+ZF\nRPN+ZF\n\nRPN+ZF (no NMS)\nRPN+ZF (no cls)\nRPN+ZF (no cls)\nRPN+ZF (no cls)\nRPN+ZF (no reg)\nRPN+ZF (no reg)\n\nRPN+VGG\n\n2k\n2k\n300\n\n300\n100\n300\n1k\n6k\n100\n300\n1k\n300\n1k\n300\n\n58.7\n58.6\n59.9\n\n58.7\n55.1\n56.8\n56.3\n55.2\n44.6\n51.4\n55.8\n52.1\n51.3\n59.2\n\n2k\n2k\n2k\n\n2k\n2k\n2k\n2k\n2k\n2k\n2k\n2k\n2k\n2k\n2k\n\ntrain-time region proposals\n# boxes\n\nmethod\n\nRPN+ZF, shared\nablation experiments follow below\nRPN+ZF, unshared\n\nSS\nEB\n\nSS\nSS\nSS\nSS\nSS\nSS\nSS\nSS\nSS\nSS\n\nproposals by the \u201cfast\u201d mode. For EdgeBoxes (EB) [24], we generate the proposals by the default\nEB setting tuned for 0.7 IoU. SS has an mAP of 58.7% and EB has an mAP of 58.6%. RPN with\nFast R-CNN achieves competitive results, with an mAP of 59.9% while using up to 300 proposals6.\nUsing RPN yields a much faster detection system than using either SS or EB because of shared conv\ncomputations; the fewer proposals also reduce the region-wise fc cost. Next, we consider several\nablations of RPN and then show that proposal quality improves when using the very deep network.\n\nAblation Experiments. To investigate the behavior of RPNs as a proposal method, we conducted\nseveral ablation studies. First, we show the effect of sharing conv layers between the RPN and Fast\nR-CNN detection network. To do this, we stop after the second step in the 4-step training process.\nUsing separate networks reduces the result slightly to 58.7% (RPN+ZF, unshared, Table 1). We\nobserve that this is because in the third step when the detector-tuned features are used to \ufb01ne-tune\nthe RPN, the proposal quality is improved.\nNext, we disentangle the RPN\u2019s in\ufb02uence on training the Fast R-CNN detection network. For this\npurpose, we train a Fast R-CNN model by using the 2k SS proposals and ZF net. We \ufb01x this detector\nand evaluate the detection mAP by changing the proposal regions used at test-time. In these ablation\nexperiments, the RPN does not share features with the detector.\nReplacing SS with 300 RPN proposals at test-time leads to an mAP of 56.8%. The loss in mAP\nis because of the inconsistency between the training/testing proposals. This result serves as the\nbaseline for the following comparisons.\nSomewhat surprisingly, the RPN still leads to a competitive result (55.1%) when using the top-\nranked 100 proposals at test-time, indicating that the top-ranked RPN proposals are accurate. On\nthe other extreme, using the top-ranked 6k RPN proposals (without NMS) has a comparable mAP\n(55.2%), suggesting NMS does not harm the detection mAP and may reduce false alarms.\nNext, we separately investigate the roles of RPN\u2019s cls and reg outputs by turning off either of them\nat test-time. When the cls layer is removed at test-time (thus no NMS/ranking is used), we randomly\nsample N proposals from the unscored regions. The mAP is nearly unchanged with N = 1k\n(55.8%), but degrades considerably to 44.6% when N = 100. This shows that the cls scores account\nfor the accuracy of the highest ranked proposals.\nOn the other hand, when the reg layer is removed at test-time (so the proposals become anchor\nboxes), the mAP drops to 52.1%. This suggests that the high-quality proposals are mainly due to\nregressed positions. The anchor boxes alone are not suf\ufb01cient for accurate detection.\n\n6For RPN, the number of proposals (e.g., 300) is the maximum number for an image. RPN may produce\n\nfewer proposals after NMS, and thus the average number of proposals is smaller.\n\n6\n\n\fTable 2: Detection results on PASCAL VOC 2007 test set. The detector is Fast R-CNN and VGG-\n16. Training data: \u201c07\u201d: VOC 2007 trainval, \u201c07+12\u201d: union set of VOC 2007 trainval and VOC\n2012 trainval. For RPN, the train-time proposals for Fast R-CNN are 2k. \u2020: this was reported in [5];\nusing the repository provided by this paper, this number is higher (68.0\u00b10.3 in six runs).\n\nmethod\n\n# proposals\n\nSS\nSS\n\nRPN+VGG, unshared\nRPN+VGG, shared\nRPN+VGG, shared\n\n2k\n2k\n300\n300\n300\n\ndata\n07\n\n07+12\n\n07\n07\n\n07+12\n\nmAP (%)\n66.9\u2020\n70.0\n68.5\n69.9\n73.2\n\ntime (ms)\n\n1830\n1830\n342\n198\n198\n\nTable 3: Detection results on PASCAL VOC 2012 test set. The detector is Fast R-CNN and VGG-\n16. Training data: \u201c07\u201d: VOC 2007 trainval, \u201c07++12\u201d: union set of VOC 2007 trainval+test\nand VOC 2012 trainval. For RPN, the train-time proposals for Fast R-CNN are 2k. \u2020: http://\nhost.robots.ox.ac.uk:8080/anonymous/HZJTQA.html. \u2021: http://host.robots.ox.ac.uk:8080/\nanonymous/YNPLXB.html\n\nmethod\n\n# proposals\n\nmAP (%)\n\nSS\nSS\n\nRPN+VGG, shared\u2020\nRPN+VGG, shared\u2021\n\n2k\n2k\n300\n300\n\ndata\n12\n\n07++12\n\n12\n\n07++12\n\n65.7\n68.4\n67.0\n70.4\n\nTable 4: Timing (ms) on a K40 GPU, except SS proposal is evaluated in a CPU. \u201cRegion-wise\u201d\nincludes NMS, pooling, fc, and softmax. See our released code for the pro\ufb01ling of running time.\n\nmodel\nVGG\nVGG\nZF\n\nsystem\n\nSS + Fast R-CNN\nRPN + Fast R-CNN\nRPN + Fast R-CNN\n\nconv\n146\n141\n31\n\nproposal\n\nregion-wise\n\n1510\n10\n3\n\n174\n47\n25\n\ntotal\n1830\n198\n59\n\nrate\n\n0.5 fps\n5 fps\n17 fps\n\nWe also evaluate the effects of more powerful networks on the proposal quality of RPN alone.\nWe use VGG-16 to train the RPN, and still use the above detector of SS+ZF. The mAP improves\nfrom 56.8% (using RPN+ZF) to 59.2% (using RPN+VGG). This is a promising result, because it\nsuggests that the proposal quality of RPN+VGG is better than that of RPN+ZF. Because proposals of\nRPN+ZF are competitive with SS (both are 58.7% when consistently used for training and testing),\nwe may expect RPN+VGG to be better than SS. The following experiments justify this hypothesis.\n\nDetection Accuracy and Running Time of VGG-16. Table 2 shows the results of VGG-16 for both\nproposal and detection. Using RPN+VGG, the Fast R-CNN result is 68.5% for unshared features,\nslightly higher than the SS baseline. As shown above, this is because the proposals generated by\nRPN+VGG are more accurate than SS. Unlike SS that is pre-de\ufb01ned, the RPN is actively trained\nand bene\ufb01ts from better networks. For the feature-shared variant, the result is 69.9%\u2014better than\nthe strong SS baseline, yet with nearly cost-free proposals. We further train the RPN and detection\nnetwork on the union set of PASCAL VOC 2007 trainval and 2012 trainval, following [5]. The mAP\nis 73.2%. On the PASCAL VOC 2012 test set (Table 3), our method has an mAP of 70.4% trained\non the union set of VOC 2007 trainval+test and VOC 2012 trainval, following [5].\nIn Table 4 we summarize the running time of the entire object detection system. SS takes 1-2\nseconds depending on content (on average 1.51s), and Fast R-CNN with VGG-16 takes 320ms on\n2k SS proposals (or 223ms if using SVD on fc layers [5]). Our system with VGG-16 takes in total\n198ms for both proposal and detection. With the conv features shared, the RPN alone only takes\n10ms computing the additional layers. Our region-wise computation is also low, thanks to fewer\nproposals (300). Our system has a frame-rate of 17 fps with the ZF net.\n\nAnalysis of Recall-to-IoU. Next we compute the recall of proposals at different IoU ratios with\nground-truth boxes. It is noteworthy that the Recall-to-IoU metric is just loosely [9, 8, 1] related to\nthe ultimate detection accuracy. It is more appropriate to use this metric to diagnose the proposal\nmethod than to evaluate it.\n\n7\n\n\fFigure 2: Recall vs. IoU overlap ratio on the PASCAL VOC 2007 test set.\n\nTable 5: One-Stage Detection vs. Two-Stage Proposal + Detection. Detection results are on the\nPASCAL VOC 2007 test set using the ZF model and Fast R-CNN. RPN uses unshared features.\n\nregions\n\ndetector\n\nmAP (%)\n\nTwo-Stage\nOne-Stage\nOne-Stage\n\nRPN + ZF, unshared\n\ndense, 3 scales, 3 asp. ratios\ndense, 3 scales, 3 asp. ratios\n\n300\n20k\n20k\n\nFast R-CNN + ZF, 1 scale\nFast R-CNN + ZF, 1 scale\nFast R-CNN + ZF, 5 scales\n\n58.7\n53.8\n53.9\n\nIn Fig. 2, we show the results of using 300, 1k, and 2k proposals. We compare with SS and EB, and\nthe N proposals are the top-N ranked ones based on the con\ufb01dence generated by these methods.\nThe plots show that the RPN method behaves gracefully when the number of proposals drops from\n2k to 300. This explains why the RPN has a good ultimate detection mAP when using as few as 300\nproposals. As we analyzed before, this property is mainly attributed to the cls term of the RPN. The\nrecall of SS and EB drops more quickly than RPN when the proposals are fewer.\n\nOne-Stage Detection vs. Two-Stage Proposal + Detection. The OverFeat paper [18] proposes a\ndetection method that uses regressors and classi\ufb01ers on sliding windows over conv feature maps.\nOverFeat is a one-stage, class-speci\ufb01c detection pipeline, and ours is a two-stage cascade consisting\nof class-agnostic proposals and class-speci\ufb01c detections. In OverFeat, the region-wise features come\nfrom a sliding window of one aspect ratio over a scale pyramid. These features are used to simulta-\nneously determine the location and category of objects. In RPN, the features are from square (3\u00d73)\nsliding windows and predict proposals relative to anchors with different scales and aspect ratios.\nThough both methods use sliding windows, the region proposal task is only the \ufb01rst stage of RPN\n+ Fast R-CNN\u2014the detector attends to the proposals to re\ufb01ne them. In the second stage of our cas-\ncade, the region-wise features are adaptively pooled [7, 5] from proposal boxes that more faithfully\ncover the features of the regions. We believe these features lead to more accurate detections.\nTo compare the one-stage and two-stage systems, we emulate the OverFeat system (and thus also\ncircumvent other differences of implementation details) by one-stage Fast R-CNN. In this system,\nthe \u201cproposals\u201d are dense sliding windows of 3 scales (128, 256, 512) and 3 aspect ratios (1:1, 1:2,\n2:1). Fast R-CNN is trained to predict class-speci\ufb01c scores and regress box locations from these\nsliding windows. Because the OverFeat system uses an image pyramid, we also evaluate using conv\nfeatures extracted from 5 scales. We use those 5 scales as in [7, 5].\nTable 5 compares the two-stage system and two variants of the one-stage system. Using the ZF\nmodel, the one-stage system has an mAP of 53.9%. This is lower than the two-stage system (58.7%)\nby 4.8%. This experiment justi\ufb01es the effectiveness of cascaded region proposals and object detec-\ntion. Similar observations are reported in [5, 13], where replacing SS region proposals with sliding\nwindows leads to \u223c6% degradation in both papers. We also note that the one-stage system is slower\nas it has considerably more proposals to process.\n\n5 Conclusion\n\nWe have presented Region Proposal Networks (RPNs) for ef\ufb01cient and accurate region proposal\ngeneration. By sharing convolutional features with the down-stream detection network, the region\nproposal step is nearly cost-free. Our method enables a uni\ufb01ed, deep-learning-based object detection\nsystem to run at 5-17 fps. The learned RPN also improves region proposal quality and thus the\noverall object detection accuracy.\n\n8\n\n\u03ec\u0358\u03f1\u03ec\u0358\u03f2\u03ec\u0358\u03f3\u03ec\u0358\u03f4\u03ec\u0358\u03f5\u03ed\u03ec\u03ec\u0358\u03ee\u03ec\u0358\u03f0\u03ec\u0358\u03f2\u03ec\u0358\u03f4\u03ed/\u017dhZ\u011e\u0110\u0102\u016f\u016f\u03ef\u03ec\u03ec\u0003\u0189\u018c\u017d\u0189\u017d\u0190\u0102\u016f\u0190 ^^\u001c\u0011ZWE\u0003\u007f&ZWE\u0003s''\u03ec\u0358\u03f1\u03ec\u0358\u03f2\u03ec\u0358\u03f3\u03ec\u0358\u03f4\u03ec\u0358\u03f5\u03ed\u03ec\u03ec\u0358\u03ee\u03ec\u0358\u03f0\u03ec\u0358\u03f2\u03ec\u0358\u03f4\u03ed/\u017dh\u03ed\u03ec\u03ec\u03ec\u0003\u0189\u018c\u017d\u0189\u017d\u0190\u0102\u016f\u0190 ^^\u001c\u0011ZWE\u0003\u007f&ZWE\u0003s''\u03ec\u0358\u03f1\u03ec\u0358\u03f2\u03ec\u0358\u03f3\u03ec\u0358\u03f4\u03ec\u0358\u03f5\u03ed\u03ec\u03ec\u0358\u03ee\u03ec\u0358\u03f0\u03ec\u0358\u03f2\u03ec\u0358\u03f4\u03ed/\u017dh\u03ee\u03ec\u03ec\u03ec\u0003\u0189\u018c\u017d\u0189\u017d\u0190\u0102\u016f\u0190 ^^\u001c\u0011ZWE\u0003\u007f&ZWE\u0003s''\fReferences\n[1] N. Chavali, H. Agrawal, A. Mahendru, and D. Batra. Object-Proposal Evaluation Protocol is \u2019Gameable\u2019.\n\narXiv: 1505.05836, 2015.\n\n[2] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation.\n\nCVPR, 2015.\n\nIn\n\n[3] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks.\n\nIn CVPR, 2014.\n\n[4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object\n\nClasses Challenge 2007 (VOC2007) Results, 2007.\n[5] R. Girshick. Fast R-CNN. arXiv:1504.08083, 2015.\n[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection\n\nand semantic segmentation. In CVPR, 2014.\n\n[7] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual\n\nrecognition. In ECCV. 2014.\n\n[8] J. Hosang, R. Benenson, P. Doll\u00b4ar, and B. Schiele. What makes for effective detection proposals?\n\narXiv:1502.05082, 2015.\n\n[9] J. Hosang, R. Benenson, and B. Schiele. How good are detection proposals, really? In BMVC, 2014.\n[10] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.\n\n[11] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi\ufb01cation with deep convolutional neural net-\n\nworks. In NIPS, 2012.\n\n[12] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backprop-\n\nagation applied to handwritten zip code recognition. Neural computation, 1989.\n\n[13] K. Lenc and A. Vedaldi. R-CNN minus R. arXiv:1506.06981, 2015.\n[14] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR,\n\n2015.\n\n[15] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In ICML, 2010.\n[16] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature\n\nmaps. arXiv:1504.06066, 2015.\n\n[17] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nImageNet Large Scale Visual Recognition Challenge.\n\nM. Bernstein, A. C. Berg, and L. Fei-Fei.\narXiv:1409.0575, 2014.\n\n[18] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition,\n\nlocalization and detection using convolutional networks. In ICLR, 2014.\n\n[19] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nICLR, 2015.\n\n[20] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov.\n\narXiv:1412.1441v2, 2015.\n\nScalable, high-quality object detection.\n\n[21] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks for object detection. In NIPS, 2013.\n[22] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition.\n\nIJCV, 2013.\n\n[23] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. In ECCV,\n\n2014.\n\n[24] C. L. Zitnick and P. Doll\u00b4ar. Edge boxes: Locating object proposals from edges. In ECCV, 2014.\n\n9\n\n\f", "award": [], "sourceid": 69, "authors": [{"given_name": "Shaoqing", "family_name": "Ren", "institution": "USTC"}, {"given_name": "Kaiming", "family_name": "He", "institution": "Microsoft Research Asia"}, {"given_name": "Ross", "family_name": "Girshick", "institution": "Microsoft Research"}, {"given_name": "Jian", "family_name": "Sun", "institution": "Microsoft Research Asia"}]}