{"title": "FreeAnchor: Learning to Match Anchors for Visual Object Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 147, "page_last": 155, "abstract": "Modern CNN-based object detectors assign anchors for ground-truth objects under the restriction of object-anchor Intersection-over-Unit (IoU). In this study, we propose a learning-to-match approach to break IoU restriction, allowing objects to match anchors in a flexible manner. Our approach, referred to as FreeAnchor, updates hand-crafted anchor assignment to \"free\" anchor matching by formulating detector training as a maximum likelihood estimation (MLE) procedure. FreeAnchor targets at learning features which best explain a class of objects in terms of both classification and localization. FreeAnchor is implemented by optimizing detection customized likelihood and can be fused with CNN-based detectors in a plug-and-play manner. Experiments on MS-COCO demonstrate that FreeAnchor consistently outperforms the counterparts with significant margins.", "full_text": "FreeAnchor: Learning to Match Anchors for Visual\n\nObject Detection\n\nXiaosong Zhang1,\n\nFang Wan1, Chang Liu1, Rongrong Ji2, Qixiang Ye1,3\u2217\n\n1University of Chinese Academy of Sciences, Beijing, China\n\n2Xiamen University, Xiamen, China\n\n3Peng Cheng Laboratory, Shenzhen, China\n\nzhangxiaosong18@mails.ucas.ac.cn, qxye@ucas.ac.cn\n\nAbstract\n\nModern CNN-based object detectors assign anchors for ground-truth objects under\nthe restriction of object-anchor Intersection-over-Unit (IoU). In this study, we\npropose a learning-to-match approach to break IoU restriction, allowing objects\nto match anchors in a \ufb02exible manner. Our approach, referred to as FreeAnchor,\nupdates hand-crafted anchor assignment to \u201cfree\" anchor matching by formulating\ndetector training as a maximum likelihood estimation (MLE) procedure. FreeAn-\nchor targets at learning features which best explain a class of objects in terms of\nboth classi\ufb01cation and localization. FreeAnchor is implemented by optimizing\ndetection customized likelihood and can be fused with CNN-based detectors in\na plug-and-play manner. Experiments on COCO demonstrate that FreeAnchor\nconsistently outperforms the counterparts with signi\ufb01cant margins1.\n\n1\n\nIntroduction\n\nOver the past few years we have witnessed the success of convolution neural network (CNN) for\nvisual object detection [1, 2, 3, 4, 5, 6, 7]. To represent objects with various appearance, aspect ratios,\nand spatial layouts with limited convolution features, most CNN-based detectors leverage anchor\nboxes at multiple scales and aspect ratios as reference points for object localization [3, 4, 5, 6, 7]. By\nassigning each object to a single or multiple anchors, features can be determined and two fundamental\nprocedures, classi\ufb01cation and localization (i.e., bounding box regression), are carried out.\nAnchor-based detectors leverage spatial alignment, i.e., Intersection over Unit (IoU) between objects\nand anchors, as the criterion for anchor assignment. Each assigned anchor independently supervises\nnetwork learning for object prediction, based upon the intuition that the anchors aligned with object\nbounding boxes are most appropriate for object classi\ufb01cation and localization. However, we argue\nthat such intuition is implausible and the hand-crafted IoU criterion is not the best choice.\nOn the one hand, for objects of acentric features, e.g., slender objects, the most representative\nfeatures are not close to object centers. A spatially aligned anchor might correspond to fewer\nrepresentative features, which deteriorate classi\ufb01cation and localization capabilities. On the other\nhand, it is infeasible to match proper anchors/features for objects using IoU when multiple objects\ncome together.\nIt is hard to design a generic rule which can optimally match anchors/features with objects of various\ngeometric layouts. The widely used hand-crafted assignment could fail when facing acentric, slender,\nand/or crowded objects. A learning-based approach requires to be explored to solve this problem in a\nsystematic way, which is the focus of this study.\n\n*Corresponding Author.\n1Code is available at https://github.com/zhangxiaosong18/FreeAnchor\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe propose a learning-to-match approach for object detection, and target at discarding hand-crafted\nanchor assignment while optimizing learning procedures of visual object detection from three speci\ufb01c\naspects. First, to achieve a high recall rate, the detector is required to guarantee that for each\nobject at least one anchor\u2019s prediction is close to the ground-truth. Second, in order to achieve high\ndetection precision, the detector needs to classify anchors with poor localization (large bounding box\nregression error) into background. Third, the predictions of anchors should be compatible with the\nnon-maximum suppression (NMS) procedure, i.e., the higher the classi\ufb01cation score is, the more\naccurate the localization is. Otherwise, an anchor with accurate localization but low classi\ufb01cation\nscore could be suppressed when using the NMS process.\nTo ful\ufb01ll these objectives, we formulate object-anchor matching as a maximum likelihood estimation\n(MLE) procedure [8, 9], which selects the most representative anchor from a \u201cbag\" of anchors for\neach object. We de\ufb01ne the likelihood probability of each anchor bag as the largest anchor con\ufb01dence\nwithin it. Maximizing the likelihood probability guarantees that there exists at least one anchor,\nwhich has high con\ufb01dence for both object classi\ufb01cation and localization. Meanwhile, most anchors,\nwhich have large classi\ufb01cation or localization error, are classi\ufb01ed as background. During training,\nthe likelihood probability is converted into a loss function, which then drives CNN-based detector\ntraining and object-anchor matching.\nThe contributions of this work are concluded as follows:\n\n\u2022 We formulate detector training as an MLE procedure and update hand-crafted an-\nchor assignment to free anchor matching. The proposed approach breaks the IoU restric-\ntion, allowing objects to \ufb02exibly select anchors under the principle of maximum likelihood.\n\n\u2022 We de\ufb01ne a detection customized likelihood, and implement joint optimization of ob-\nject classi\ufb01cation and localization in an end-to-end mechanism. Maximizing the likeli-\nhood drives network learning to match optimal anchors and guarantees the comparability of\nwith the NMS procedure.\n\n2 Related Work\n\nObject detection requires generating a set of bounding boxes along with their classi\ufb01cation labels\nassociated with objects in an image. However, it is not trivial for a CNN-based detector to directly\npredict an order-less set of arbitrary cardinals. One widely-used workaround is to introduce anchors,\nwhich employs a divide-and-conquer process to match objects with features. This approach has\nbeen successfully demonstrated in Faster R-CNN [3], SSD [5], FPN [6], RetinaNet [7], DSSD [10]\nand YOLOv2 [11]. In these detectors, dense anchors need to be con\ufb01gured over convolutional\nfeature maps so that features extracted from anchors can match object windows and the bounding\nbox regression can be well initialized. Anchors are then assigned to objects or backgrounds by\nthresholding their IoUs with ground-truth bounding boxes [3].\nAlthough effective, these approaches are restricted by heuristics that spatially aligned anchors are\ncompatible for both object classi\ufb01cation and localization. For objects of acentric features, however,\nthe detector could miss the best anchors and features.\nTo break this limitation imposed by pre-assigned anchors, recent anchor-free approaches employ\npixel-level supervision [12] and center-ness bounding box regression [13]. CornerNet [14] and\nCenterNet [15] replace bounding box supervision with key-point supervision. MetaAnchor [16]\napproach learns to produce anchors from the arbitrary customized prior boxes with a sub-network.\nGuidedAnchoring [17] leverages semantic features to guide the prediction of anchors while replacing\ndense anchors with predicted anchors. IoU-Net [18] incorporates IoU-guided NMS, which helps\neliminating the suppression failure caused by the misleading classi\ufb01cation con\ufb01dences.\nExisting approaches have taken a step towards learnable anchor customization. Nevertheless, to the\nbest of our knowledge, there still lacks a systematic approach to model the correspondence between\nanchors and objects during detector training, which inhibits the optimization of feature selection and\nfeature learning.\n\n2\n\n\fFigure 1: Comparison of hand-crafted anchor assignment (top) and FreeAnchor (bottom). FreeAnchor\nallows each object to \ufb02exibly match the best anchor from a \u201cbag\" of anchors during detector training.\n\n3 The Proposed Approach\n\nTo model the correspondence between objects and anchors, we propose to formulate detector training\nas an MLE procedure. We then de\ufb01ne the detection customized likelihood, which simultaneously\nfacilitates object classi\ufb01cation and localization. During detector training, we convert detection\ncustomized likelihood into detection customized loss and jointly optimizing object classi\ufb01cation,\nobject localization, and object-anchor matching in an end-to-end mechanism.\n\n3.1 Detector Training as Maximum Likelihood Estimation\n\ni\n\ni\n\nj \u2208 Rk after the Sigmoid activation, and a location prediction aloc\n\nLet\u2019s begin with a CNN-based one-stage detector [7]. Given an input image I, the ground-truth\nannotations are denoted as B, where a ground-truth box bi \u2208 B is made up of a class label bcls\nand a\n. During the forward propagation procedure of the network, each anchor aj \u2208 A obtains a\nlocation bloc\nj = {x, y, w, h}\nclass prediction acls\nafter the bounding box regression. k denotes the number of object classes.\nDuring training, hand-crafted criterion based on IoU is used to assign anchors for objects, Fig. 1,\nand a matrix Cij \u2208 {0, 1} is de\ufb01ned to indicate whether object bi matches anchor aj or not. When\nthe IoU of bi and aj is greater than a threshold, bi matches aj and Cij = 1. Otherwise, Cij = 0.\ni Cij \u2208 {0, 1},\u2200aj \u2208 A. By de\ufb01ning A+ \u2286 A as {aj |(cid:80)\nSpecially, when multiple objects\u2019 IoU are greater than this threshold, the object of the largest IoU\nwill successfully match this anchor, which guarantees that each anchor is matched by a single object\ni Cij = 1} and A\u2212 \u2286 A\ni Cij = 0}, the loss function L(\u03b8) of the detector is written as follows:\n(cid:88)\n(cid:88)\nLbg\nL(\u03b8) =\nj (\u03b8),\n\nat most, i.e.,(cid:80)\nas {aj |(cid:80)\n\nCijLloc\n\nCijLcls\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nij (\u03b8) + \u03b2\n\nij (\u03b8) +\n\n(1)\n\naj\u2208A+\n\nbi\u2208B\n\naj\u2208A+\n\nbi\u2208B\n\nwhere \u03b8 denotes the network parameters to be learned. Lcls\nij (\u03b8) =\n,(cid:126)0, \u03b8) respectively denote the Binary Cross\nSmoothL1(aloc\nEntropy loss (BCE) for classi\ufb01cation and the SmoothL1 loss de\ufb01ned for localization [2]. \u03b2 is a\nregularization factor and \u201cbg\" indicates \u201cbackground\".\n\nj (\u03b8) = BCE(acls\nj\n\n, \u03b8) and Lbg\n\n, \u03b8), Lloc\n\n, bloc\n\n, bcls\n\nj\n\ni\n\ni\n\naj\u2208A\u2212\nij (\u03b8) = BCE(acls\nj\n\n3\n\n\fFrom the MLE perspective, the training loss L(\u03b8) is converted into a likelihood probability, as follows:\n\nP(\u03b8) = e\u2212L(\u03b8)\n\n(cid:89)\n(cid:89)\n\naj\u2208A+\n\n=\n\n=\n\n(cid:0)(cid:88)\n(cid:0)(cid:88)\n\nbi\u2208B\n\nbi\u2208B\n\nij (\u03b8)(cid:1) (cid:89)\nij (\u03b8)(cid:1) (cid:89)\n\naj\u2208A+\n\n(cid:0)(cid:88)\n(cid:0)(cid:88)\n\nbi\u2208B\n\naj\u2208A+\n\nbi\u2208B\n\nCije\u2212Lcls\n\nCijP cls\n\nCije\u2212\u03b2Lloc\n\nij (\u03b8)(cid:1) (cid:89)\nij (\u03b8)(cid:1) (cid:89)\n\nCijP loc\n\naj\u2208A\u2212\nP bg\nj (\u03b8),\n\naj\u2208A\u2212\n\ne\u2212Lbg\n\nj (\u03b8)\n\n(2)\n\nj (\u03b8) denote classi\ufb01cation con\ufb01dence and P loc\n\nwhere P cls\nij (\u03b8) denotes localization con\ufb01-\ndence. Minimizing the loss function L(\u03b8) de\ufb01ned in Eq. 1 is equal to maximizing the likelihood\nprobability P(\u03b8) de\ufb01ned in Eq. 2.\nEq. 2 strictly considers the optimization of classi\ufb01cation and localization of anchors from the MLE\nperspective. However, it unfortunately ignores how to learn the matching matrix Cij. Existing\nCNN-based detectors [3, 5, 6, 7, 11] solve this problem by empirically assigning anchors using the\nIoU criterion, Fig. 1, but ignoring the optimization of object-anchor matching.\n\naj\u2208A+\nij (\u03b8) and P bg\n\n3.2 Detection Customized Likelihood\n\nTo achieve the optimization of object-anchor matching, we extend the CNN-based detection frame-\nwork by introducing detection customized likelihood. Such likelihood intends to incorporate the\nobjectives of recall and precision while guaranteeing the compatibility with NMS.\nTo implement the likelihood, we \ufb01rst construct a bag of candidate anchors for each object bi by\nselecting (n) top-ranked anchors Ai \u2282 A in terms of their IoU with the object. We then learns to\nmatch the best anchor while maximizing the detection customized likelihood.\nTo optimize the recall rate, for each object bi \u2208 B we requires to guarantee that there exists at least\none anchor aj \u2208 Ai, whose prediction (acls\n) is close to the ground-truth. The objective\nfunction can be derived from the \ufb01rst two terms of Eq. 2, as follows:\n\nand aloc\n\nj\n\nPrecall(\u03b8) =\n\nj\n\nmax\naj\u2208Ai\n\nij (\u03b8)P loc\n\n(cid:0)P cls\n\n(cid:89)\nij (\u03b8)(cid:1).\n(cid:89)\n(cid:0)1 \u2212 P{aj \u2208 A\u2212}(1 \u2212 P bg\n\ni\n\nj (\u03b8))(cid:1),\n\nTo achieve increased detection precision, detectors need to classify the anchors of poor localization\ninto the background class. This is ful\ufb01lled by optimizing the following objective function:\n\nPprecision(\u03b8) =\n\nj\n\n(4)\nwhere P{aj \u2208 A\u2212} = 1 \u2212 maxi P{aj \u2192 bi} is the probability that aj misses all objects and\nP{aj \u2192 bi} denotes the probability that anchor aj correctly predicts object bi.\nTo be compatible with the NMS procedure, P{aj \u2192 bi} should have the following three properties:\n(1) P{aj \u2192 bi} is a monotonically increasing function of the IoU between aloc\nij . (2)\nis smaller than a threshold t, P{aj \u2192 bi} is close to 0. (3) For an object bi, there\nWhen IoU loc\nij\nexists one and only one aj satisfying P{aj \u2192 bi} = 1. These properties can be satis\ufb01ed with a\nsaturated linear function, as\n\nand bi, IoU loc\n\nj\n\n(3)\n\n0,\nx \u2212 t1\nt2 \u2212 t1\n1,\n\n,\n\nx \u2264 t1\nt1 < x < t2,\nx \u2265 t2\n\nSaturated linear(x, t1, t2) =\n\nwhich is shown in Fig. 2, and we have P{aj \u2192 bi} = Saturated linear(cid:0)IoU loc\n(cid:0)1 \u2212 P{aj \u2208 A\u2212}(1 \u2212 P bg\n\nP(cid:48)(\u03b8) = Precall(\u03b8) \u00d7 Pprecision(\u03b8)\n\nImplementing the de\ufb01nitions provided above, the detection customized likelihood is de\ufb01ned as\nfollows:\n\nij , t, maxj(IoU loc\n\nij )(cid:1).\n\nj (\u03b8))(cid:1),\n\nij (\u03b8)P loc\n\n(cid:89)\n\n(P cls\n\n(5)\n\n=\n\nmax\naj\u2208Ai\n\ni\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\nij (\u03b8)) \u00d7(cid:89)\n\nj\n\n4\n\n\fFigure 2: Saturated linear function.\n\nFigure 3: Mean-max function.\n\nwhich incorporates the objectives of recall, precision and compatibility with NMS. By optimizing\nthis likelihood, we simultaneously maximize the probability of recall Precall(\u03b8) and precision\nPprecision(\u03b8) and then achieve free object-anchor matching during detector training.\n\n3.3 Anchor Matching Mechanism\n\nTo implement this learning-to-match approach in a CNN-based detector, the detection customized\nlikelihood de\ufb01ned by Eq. 5 is converted to a detection customized loss function, as follows:\n\nL(cid:48)(\u03b8) = \u2212 log P(cid:48)(\u03b8)\n\n= \u2212(cid:88)\n\nlog(cid:0) max\n\naj\u2208Ai\n\ni\n\nij (\u03b8))(cid:1) \u2212(cid:88)\n\nlog(cid:0)1 \u2212 P{aj \u2208 A\u2212}(1 \u2212 P bg\n\nj (\u03b8))(cid:1),\n\n(P cls\n\nij (\u03b8)P loc\n\n(6)\n\nwhere the max function is used to select the best anchor for each object. During training, a single\nanchor is selected from a bag of anchors Ai, which is then used to update the network parameter \u03b8.\nAt early training epochs, the con\ufb01dence of all anchors is small for randomly initialized network\nparameters. The anchor with the highest con\ufb01dence is not suitable for detector training. We therefore\npropose using the Mean-max function, de\ufb01ned as:\n\nj\n\n(cid:80)\n(cid:80)\n\nMean-max(X) =\n\nxj\u2208X\n\nxj\u2208X\n\nxj\n1 \u2212 xj\n1 \u2212 xj\n\n1\n\n,\n\nwhich is used to select anchors. When training is insuf\ufb01cient, the Mean-max function, as shown\nin Fig. 3, will be close to the mean function, which means almost all anchors in bag are used for\ntraining. Along with training, the con\ufb01dence of some anchors increases and the Mean-max function\nmoves closer to the max function. When suf\ufb01cient training has taken place, a single best anchor can\nbe selected from a bag of anchors to match each object.\nReplacing the max function in Eq. 6 with Mean-max, adding balance factor w1 w2, and applying\nfocal loss [7] to the second term of Eq. 6, the detection customized loss function of an FreeAnchor\ndetector is concluded, as follows:\n\n(cid:88)\n\nlog(cid:0)Mean-max(Xi)(cid:1) + w2\n\n(cid:88)\n\nF L(cid:0)P{aj \u2208 A\u2212}(1 \u2212 P bg\n\nj (\u03b8))(cid:1),\n\nL(cid:48)(cid:48)(\u03b8) = \u2212w1\n\n(7)\n\ni\n\nj\n\nij (\u03b8)P loc\nwhere Xi = {P cls\nBy using the parameters \u03b1 and \u03b3 from focal loss [7], we set w1 = \u03b1||B||\nF L(x) = \u2212x\u03b3 log (1 \u2212 x).\nWith the detection customized loss de\ufb01ned, we implement the training procedure as Algorithm 1.\n\nij (\u03b8)| aj \u2208 Ai} is a likelihood set corresponding to the anchor bag Ai.\nn||B||, and\n\n, w2 = 1\u2212\u03b1\n\n5\n\n0.00.20.40.60.81.0x0.00.20.40.60.81.0Saturated linear(x,t1,t2)t1t2t1=0.5, t2=0.90.00.20.40.60.81.0x10.00.20.40.60.81.0x2maxmean0.2000.3500.5000.6500.800\fFigure 4: Comparison of learning-to-match anchors (left) with hand-crafted anchor assignment\n(right) for the \u201claptop\u201d object. Red dots denote anchor centers. Darker (redder) dots denote higher\ncon\ufb01dence to be matched. For clarity, we select 16 anchors of aspect-ratio 1:1 from all 50 anchors\nfor illustration. (Best viewed in color)\n\nAlgorithm 1 Detector training with FreeAnchor.\nInput:\n\nI: Input image.\nB: A set of ground-truth bounding boxes bi.\nA: A set of anchors aj in image.\nn: Hyper-parameter about anchor bag size .\nOutput: \u03b8: Detection network parameters.\n1: \u03b8 \u2190 initialize network parameters.\n2: for i=1:MaxIter do\n3:\n\nForward propagation:\n\n4:\n\n5:\n\n6:\n\n7: end for\n8: return \u03b8\n\nj\n\nfor each anchor aj \u2208 A.\n\nj\n\nLoss calculation:\n\nAnchor bag construction:\n\nand location aloc\n\nPredict class acls\nAi \u2190 Select n top-ranked anchors aj in terms of their IoU with bi.\nCalculate L(cid:48)(cid:48)(\u03b8) with Eq. 7.\n\u03b8t+1 = \u03b8t \u2212 \u03bb\u2207\u03b8tL(cid:48)(cid:48)(\u03b8t) using a stochastic gradient descent algorithm.\n\nBackward propagation:\n\n4 Experiments\n\nIn this section, we present the implementation of an FreeAnchor detector to appraise the effect\nof the proposed learning-to-match approach. We also compare the FreeAnchor detector with the\ncounterpart and the state-of-the-art approaches. Experiments were carried out on COCO 2017[19],\nwhich contains \u223c118k images for training, 5k for validation (val) and \u223c20k for testing without\nprovided annotations (test-dev). Detectors were trained on COCO training set, and evaluated on the\nval set. Final results were reported on the test-dev set.\n\n4.1\n\nImplementation Details\n\nFreeAnchor is implemented upon a state-of-the-art one-stage detector, RetinaNet [7], using\nResNet [20] and ResNeXt [21] as the backbone networks. By simply replacing the loss de\ufb01ned in\nRetinaNet with the proposed detection customized loss, Eq. 7, we updated the RetinaNet detector\nto an FreeAnchor detector. For the last convolutional layer of the classi\ufb01cation subnet, we set the\nbias initialization to b = \u2212 log ((1 \u2212 \u03c1)/\u03c1) with \u03c1 = 0.02. Training used synchronized SGD over 8\nTesla V100 GPUs with a total of 16 images per mini-batch (2 images per GPU). Unless otherwise\nspeci\ufb01ed, all models were trained for 90k iterations with an initial learning rate of 0.01, which is then\ndivided by 10 at 60k and again at 80k iterations.\n\n4.2 Model Effect\n\nLearning-to-match: The proposed learning-to-match approach can select proper anchors to represent\nthe object of interest, Fig. 4. As analyzed in the introduction section, hand-crafted anchor assignment\noften fails in two situations: Firstly, slender objects with acentric features; and secondly when multiple\n\n6\n\nlaptoplaptoplaptoplaptoplaptopanchor bag initilizaion10k iterations50k iterations90k (the last) iterationshand-crafted assignment\fFigure 5: Performance comparison on square and slender\nobjects.\n\nFigure 6: Performance comparison\non object crowdedness.\n\nobjects are provided in crowded scenes. FreeAnchor effectively alleviated these two problems. Over\nslender objects, FreeAnchor signi\ufb01cantly outperformed the RetinaNet baseline, Fig. 5. For other\nsquare objects FreeAnchor reported comparable performance with RetinaNet. The reason for this\nis that the learning-to-match procedure drives network activating at least one anchor within each\nobject\u2019s anchor bag in order to predict correct category and location. The anchor is not necessary\nspatially aligned with the object, but has the most representative features for object classi\ufb01cation and\nlocalization.\nWe further compared the performance of RetinaNet and FreeAnchor in scenarios of various crowd-\nedness, Fig. 6. As the number of objects in each image increased, the FreeAnchor\u2019s advantage\nover RetinaNet became more and more obvious. This demonstrated that our approach, with the\nlearning-to-match mechanism, can select more suitable anchors to objects in crowded scenes.\nCompatibility with NMS: To assess the compatibility of anchors\u2019 predictions with NMS, we de\ufb01ned\nthe NMS Recall (NR\u03c4 ) as the ratio of the recall rates after and before NMS for a given IoU thresholds\n\u03c4. Following the COCO-style AP metric [19], NR was de\ufb01ned as the averaged NR\u03c4 when \u03c4 changes\nfrom 0.50 to 0.90 with an interval of 0.05, Table 1. We compared RetinaNet and FreeAnchor in terms\nof their NR\u03c4 . It can be seen that FreeAnchor reported higher NR\u03c4 , which means higher compatibility\nwith NMS. This validated that the detection customized likelihood, de\ufb01ned in Section 3.2, can drive\njoint optimization of classi\ufb01cation and localization.\n\nTable 1: Comparison of NMS recall (%) on COCO val set.\n\nbackbone\n\ndetector\n\nResNet-50\n\nRetinaNet [7]\n\nFreeAnchor (ours)\n\nNR NR50 NR60 NR70 NR80 NR90\n51.3\n81.8\n83.8\n53.1\n\n87.0\n89.5\n\n71.8\n74.3\n\n98.3\n99.2\n\n95.7\n97.5\n\n4.3 Parameter Setting\n\nAnchor bag size n: We evaluated anchor bag sizes in {40, 50, 60, 100} and observed that the bag\nsize 50 reported the best performance.\nBackground IoU threshold t: A threshold was used in P{aj \u2192 bi} during training. We tried\nbackground IoU thresholds in {0.5, 0.6, 0.7} and validated that 0.6 worked best.\nFocal loss parameter: FreeAnchor introduced a bag of anchors to replace independent anchors\nand therefore faced more serious sample imbalance. To handle the imbalance, we experimented the\nparameters in Focal Loss [7] as \u03b1 in {0.25, 0.5, 0.75} and \u03b3 in {1.5 , 2.0, 2.5}, and set \u03b1 = 0.5 and\n\u03b3 = 2.0.\nLoss regularization factor \u03b2: The regularization factor \u03b2 in Eq. 1, which balances the loss of\nclassi\ufb01cation and localization, was experimentally validated to be 0.75.\n\n7\n\nclocktraffic lightsports balltoothbrushskiscouchtieobjects category020406080100AP (%)49.127.644.810.415.636.728.248.727.344.814.219.641.433.0square objectsslender objectsRetinaNetFreeAnchor[0, 10][11, 20][21, 30][31, inf)number of objects per image2530354045AP (%)RetinaNetFreeAnchor\f4.4 Detection Performance\n\nIn Table 2, FreeAnchor was compared with the RetinaNet baseline. FreeAnchor consistently\nimproved the AP up to \u223c3.0%, which is a signi\ufb01cant margin in terms of the challenging object\ndetection task. Note that the performance gain was achieved with negligible cost of training time.\n\nTable 2: Performance comparison of FreeAnchor and RetinaNet (baseline).\n\nBackbone\n\nDetector\n\nResNet-50\n\nResNet-101\n\nRetinaNet [7]\n\nFreeAnchor (ours)\n\nRetinaNet [7]\n\nFreeAnchor (ours)\n\nTraining\n\ntime\n5.02h\n5.27h\n6.96h\n7.26h\n\nAP AP50 AP75 APS APM APL\n\n35.7\n38.7\n37.8\n40.9\n\n55.0\n57.3\n57.5\n59.9\n\n38.5\n41.6\n40.8\n43.8\n\n18.9\n20.2\n20.2\n21.7\n\n38.9\n41.3\n41.1\n43.8\n\n46.3\n50.1\n49.2\n53.0\n\nFreeAnchor was compared with state-of-the-art one-stage detectors in Table 3, used scale jitter and\n2\u00d7 longer training than the same model from Table 2. It outperformed the baseline RetinaNet [7]\nand the anchor-free approaches including FoveaBox [22], FSAF [23], FCOS [13] and CornerNet [14].\nWith a litter ResNeXt-64x4d-101 backbone network and fewer training iterations, FreeAnchor was\ncomparable with CenterNet in AP (44.9% vs. 44.9%) and reported higher AP50, which is a more\nwidely used metric in many applications.\n\u201cFreeAnchor*\" refers to extending the scale range from [640, 800] to [480, 960], achieving 46.0%\nAP. \u201cFreeAnchor**\" further utilized multi-scale testing over scales {480, 640, 800, 960, 1120, 1280},\nand increased AP up to 47.3%, which outperformed most state-of-the-art detectors with the same\nbackbone network.\n\nTable 3: Performance comparison with state-of-the-art one-stage detectors.\n\nBackbone\nDetector\nResNet-101\nRetinaNet [7]\nResNet-101\nFoveaBox [22]\nResNet-101\nFSAF [23]\nResNet-101\nFCOS [13]\nResNeXt-101\nRetinaNet [7]\nResNeXt-101\nFoveaBox [22]\nResNeXt-101\nFSAF [23]\nFCOS [13]\nResNeXt-101\nCornerNet [14] Hourglass-104\nCenterNet [15] Hourglass-104\nFreeAnchor\nFreeAnchor\nFreeAnchor*\nFreeAnchor**\n\nResNet-101\nResNeXt-101\nResNeXt-101\nResNeXt-101\n\nIter. AP AP50 AP75 APS APM APL\n50.2\n135k\n54.5\n135k\n135k\n51.3\n51.6\n180k\n51.2\n135k\n55.6\n135k\n135k\n52.7\n53.3\n180k\n54.3\n500k\n57.4\n480k\n180k\n54.8\n55.9\n180k\n57.7\n180k\n59.0\n180k\n\n59.1\n60.1\n61.5\n60.7\n61.1\n61.9\n63.8\n62.8\n56.4\n62.4\n62.2\n64.3\n65.6\n66.3\n\n39.1\n40.6\n40.9\n41.5\n40.8\n42.1\n42.9\n43.2\n40.6\n44.9\n43.1\n44.9\n46.0\n47.3\n\n42.3\n43.5\n44.0\n45.0\n44.1\n45.2\n46.3\n46.6\n43.2\n48.1\n46.4\n48.5\n49.8\n51.5\n\n21.8\n23.3\n24.0\n24.4\n24.1\n24.9\n26.6\n26.5\n19.1\n25.6\n24.5\n26.8\n27.8\n30.6\n\n42.7\n45.2\n44.2\n44.8\n44.2\n46.8\n46.2\n46.2\n42.8\n47.4\n46.1\n48.3\n49.5\n50.4\n\n5 Conclusion\n\nWe proposed an elegant and effective approach, referred to as FreeAnchor, for visual object detection.\nFreeAnchor updated the hand-crafted anchor assignment to \u201cfree\" object-anchor correspondence\nby formulating detector training as a maximum likelihood estimation (MLE) procedure. With\nFreeAnchor implemented, we signi\ufb01cantly improved the performance of object detection, in striking\ncontrast with the baseline detector. The underlying reality is that the MLE procedure with the\ndetection customized likelihood facilitates learning convolutional features that best explain a class of\nobjects. This provides a fresh insight for the visual object detection problem.\nAcnkowledgement. This work was supported in part by the NSFC under Grant 61836012, 61671427,\nand 61771447 and Post Doctoral Innovative Talent Support Program under Grant 119103S304.\n\n8\n\n\fReferences\n[1] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate\n\nobject detection and semantic segmentation. In IEEE CVPR, pages 580\u2013587, 2014.\n\n[2] Ross B. Girshick. Fast R-CNN. In IEEE ICCV, pages 1440\u20131448, 2015.\n\n[3] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object\n\ndetection with region proposal networks. In NIPS, pages 91\u201399, 2015.\n\n[4] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Uni\ufb01ed,\n\nreal-time object detection. In IEEE CVPR, pages 779\u2013788, 2016.\n\n[5] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and\n\nAlexander C. Berg. SSD: single shot multibox detector. In ECCV, pages 21\u201337, 2016.\n\n[6] Tsung-Yi Lin, Piotr Doll\u00e1r, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie.\n\nFeature pyramid networks for object detection. In IEEE CVPR, pages 936\u2013944, 2017.\n\n[7] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Doll\u00e1r. Focal loss for dense object\n\ndetection. In IEEE ICCV, pages 2999\u20133007, 2017.\n\n[8] Oded Maron and Tom\u00e1s Lozano-P\u00e9rez. A framework for multiple-instance learning. In NIPS, pages\n\n570\u2013576, 1997.\n\n[9] Ronald A Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions\nof the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character,\n222(594-604):309\u2013368, 1922.\n\n[10] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg. Dssd: Deconvolutional\n\nsingle shot detector. arXiv:1701.06659, 2017.\n\n[11] Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. In IEEE CVPR, pages 6517\u20136525,\n\n2017.\n\n[12] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. EAST: an\n\nef\ufb01cient and accurate scene text detector. In IEEE CVPR, pages 2642\u20132651, 2017.\n\n[13] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection.\n\narXiv:1904.01355, 2019.\n\n[14] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, pages 765\u2013781, 2018.\n\n[15] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. Centernet: Object\n\ndetection with keypoint triplets. In IEEE CVPR, 2019.\n\n[16] Tong Yang, Xiangyu Zhang, Zeming Li, Wenqiang Zhang, and Jian Sun. Metaanchor: Learning to detect\n\nobjects with customized anchors. In NIPS, pages 320\u2013330, 2018.\n\n[17] Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, and Dahua Lin. Region proposal by guided\n\nanchoring. In IEEE CVPR, pages 2965\u20132974, 2019.\n\n[18] Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yuning Jiang. Acquisition of localization\n\ncon\ufb01dence for accurate object detection. In ECCV, pages 784\u2013799, 2018.\n\n[19] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays,\nPietro Perona, Deva Ramanan, Piotr Doll\u00e1r, and C. Lawrence Zitnick. Microsoft coco: Common objects in\ncontext. In ECCV, pages 740\u2013755, 2014.\n\n[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn IEEE CVPR, pages 770\u2013778, 2016.\n\n[21] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transforma-\n\ntions for deep neural networks. In IEEE CVPR, pages 1492\u20131500, 2017.\n\n[22] Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, and Jianbo Shi. Foveabox: Beyond anchor-based\n\nobject detector. arXiv:1904.03797, 2019.\n\n[23] Chenchen Zhu, Yihui He, and Marios Savvides. Feature selective anchor-free module for single-shot object\n\ndetection. In IEEE CVPR, pages 840\u2013849, 2019.\n\n9\n\n\f", "award": [], "sourceid": 81, "authors": [{"given_name": "Xiaosong", "family_name": "Zhang", "institution": "University of Chinese Academy of Sciences"}, {"given_name": "Fang", "family_name": "Wan", "institution": "University of Chinese Academy of Sciences"}, {"given_name": "Chang", "family_name": "Liu", "institution": "University of Chinese Academy of Sciences"}, {"given_name": "Rongrong", "family_name": "Ji", "institution": "Xiamen University, China"}, {"given_name": "Qixiang", "family_name": "Ye", "institution": "University of Chinese Academy of Sciences, China"}]}