{"title": "Deep Attentive Tracking via Reciprocative Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1931, "page_last": 1941, "abstract": "Visual attention, derived from cognitive neuroscience, facilitates human perception on the most pertinent subset of the sensory data. Recently, significant efforts have been made to exploit attention schemes to advance computer vision systems. For visual tracking, it is often challenging to track target objects undergoing large appearance changes. Attention maps facilitate visual tracking by selectively paying attention to temporal robust features. Existing tracking-by-detection approaches mainly use additional attention modules to generate feature weights as the classifiers are not equipped with such mechanisms. In this paper, we propose a reciprocative learning algorithm to exploit visual attention for training deep classifiers. The proposed algorithm consists of feed-forward and backward operations to generate attention maps, which serve as regularization terms coupled with the original classification loss function for training. The deep classifier learns to attend to the regions of target objects robust to appearance changes. Extensive experiments on large-scale benchmark datasets show that the proposed attentive tracking method performs favorably against the state-of-the-art approaches.", "full_text": "Deep Attentive Tracking via Reciprocative Learning\n\nShi Pu1 Yibing Song2 Chao Ma3 Honggang Zhang1\u2217 Ming-Hsuan Yang4\n\n1Beijing University of Posts and Telecommunications, Beijing, China\n\n{pushi_519200, zhhg}@bupt.edu.cn\n\n2Tencent AI Lab, Shenzhen, China\ndynamicstevenson@gmail.com\n\n3Shanghai Jiao Tong University, Shanghai, China\n\n4University of California at Merced, Merced, U.S.A\n\nchaoma@sjtu.edu.cn\n\nmhyang@ucmerced.edu\n\nhttps://ybsong00.github.io/nips18_tracking/index\n\nAbstract\n\nVisual attention, derived from cognitive neuroscience, facilitates human perception\non the most pertinent subset of the sensory data. Recently, signi\ufb01cant efforts have\nbeen made to exploit attention schemes to advance computer vision systems. For\nvisual tracking, it is often challenging to track target objects undergoing large\nappearance changes. Attention maps facilitate visual tracking by selectively paying\nattention to temporal robust features. Existing tracking-by-detection approaches\nmainly use additional attention modules to generate feature weights as the classi\ufb01ers\nare not equipped with such mechanisms. In this paper, we propose a reciprocative\nlearning algorithm to exploit visual attention for training deep classi\ufb01ers. The\nproposed algorithm consists of feed-forward and backward operations to generate\nattention maps, which serve as regularization terms coupled with the original\nclassi\ufb01cation loss function for training. The deep classi\ufb01er learns to attend to the\nregions of target objects robust to appearance changes. Extensive experiments on\nlarge-scale benchmark datasets show that the proposed attentive tracking method\nperforms favorably against the state-of-the-art approaches.\n\n1\n\nIntroduction\n\nThe recent years have witnessed growing interest in developing visual tracking methods for various\nvision applications. Visual attention plays an important role in facilitating tracking target objects\nin videos. For example, the state-of-the-art trackers based on discriminative correlation \ufb01lters\n(DCFs) [21, 11] regress input features into a Gaussian response map for target localization. They\noften apply empirical spatial weights to input features to suppress the boundary effect caused by\nthe Fourier transform. The spatial weights are generated by either a cosine [34] or a Gaussian [11]\nfunction. From the perspective of visual attention, we interpret these spatial weights as a speci\ufb01c\ntype of attention maps. To improve the localization accuracy, the weights closer to the center regions\nof the input features are set to be larger. However, using these empirical attention maps limits the\ntracking performance when target objects undergo large movements. Meanwhile, deemphasizing\nnon-central regions tends to downgrade the target response, leading to inaccurate localizations.\nOn the other hand, two-stage tracking-by-detection approaches \ufb01rst draw a set of proposals and\nthen classify each proposal as either the target or the background. Visual attention has a great\npotential of facilitating learning discriminative classi\ufb01ers. Existing deep attentive trackers [8, 27]\n\n\u2217Honggang Zhang is the corresponding author.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fmainly use additional attention modules to generate feature weights. In other words, attention\nschemes are implemented by performing feature selection. Bene\ufb01ted from the end-to-end training,\nsuch attention schemes improve the tracking accuracy by strengthening the discriminative power of\nfeatures. However, the feature weights learned in single frames are unlikely to enable classi\ufb01ers to\nconcentrate on robust features over a long temporal span. Moreover, slight inaccuracy of feature\nweights will exacerbate the misclassi\ufb01cation problem. This requires an in-depth investigation on how\nto best exploit the visual attention of deep classi\ufb01ers so that they can attend to target objects over\ntime.\nIn this paper, we propose a reciprocative learning algorithm to exploit visual attention which advances\nthe tracking-by-detection framework. Different from existing trackers using additional attention\nmodules to weigh features, we directly train an attentive classi\ufb01er. The training process consists\nof both a forward and a backward step. In the forward step, we feed an input sample into a deep\ntracking-by-detection network and compute the corresponding classi\ufb01cation score. In the backward\nstep, we take the partial derivative of this classi\ufb01cation score with respect to the input sample along\nthe direction from the last fully-connected layer towards the \ufb01rst convolutional layer. Note that, in\nthe backward step, we do not update any network parameters. Instead, we take the partial derivative\noutput of the \ufb01rst layer as the attention map. Each pixel value on this attention map indicates the\nimportance of the corresponding pixel of the input sample to affect the classi\ufb01cation accuracy. We\nexploit this map as a regularization term and add it in the loss function during training. The network\nparameters are updated following the conventional backward propagation scheme. As a result, the\ndeep classi\ufb01er learns to attend to the target regions and effectively eliminates background interference.\nIn the test stage, the learned classi\ufb01ers directly predict the classi\ufb01cation score of each input sample for\ntarget localization. We validate the effectiveness of the proposed method on large-scale benchmark\ndatasets. Our method shows favorable performance against the state-of-the-art approaches.\nWe summarize the main contributions of our work as follows:\n\nby-detection framework.\n\n\u2022 We propose a reciprocative learning algorithm to exploit visual attention within the tracking-\n\u2022 We use the attention maps as regularization terms coupled with the classi\ufb01cation loss to\n\u2022 We conduct extensive experiments on benchmark datasets where the proposed tracker\n\ntrain deep classi\ufb01ers, which in themselves learn to attend to temporal robust features.\n\nperforms favorably against state-of-the-art approaches.\n\n2 Related Work\n\nVisual tracking has been widely surveyed in the literature [42]. In this section, we mainly discuss the\nrepresentative trackers and the related topic of visual attention.\nVisual tracking. The state-of-the-art visual tracking methods typically use a one-stage regression\nframework or a two-stage classi\ufb01cation framework. The representative one-stage regression frame-\nwork is based on discriminative correlation \ufb01lters (DCFs), which regress all the circular-shifted\nversions of the input features into soft labels generated by a Gaussian function. By computing the\nspatial correlation in the Fourier domain as an element-wise product, the DCF trackers have received\nconsiderable attention recently due to the fast-tracking speed. Starting from the MOSSE tracker [4],\nextensions include kernelized correlation \ufb01lters [20, 21], spatial regularization and multi-scale fusion\n[11, 13], CNN feature integrations [34, 38, 57, 26], and end-to-end predictions [48, 3, 46, 43]. Differ-\nent from the one-stage regression framework, the two-stage tracking-by-detection framework mainly\nconsists of two steps. The \ufb01rst step draws a sparse set of samples around the previously predicted\nlocation, and the second step classi\ufb01es each sample as the target or the background. Considerable\nefforts have been made to improve the tracking-by-detection framework, including online boosting\n[17, 1], P-N learning [25], structured SVM [18, 36], CNN-SVM [22], random forests [56], multiple\ndomain learning [35], adversarial learning [44], active tracking [32] and multiple object tracking [33].\nIn this work, we extend the two-stage tracking-by-detection framework by exploiting visual attention.\nOur deep classi\ufb01er learns to attend to every discriminative region to separate target objects from the\nbackground.\nVisual attention. The visual attention starts from cognitive neuroscience, where the human per-\nception focuses on the most pertinent subset of the sensory data. The visual attention scheme\n\n2\n\n\fFigure 1: Overview of the proposed reciprocative learning algorithm. Given a training sample, we\n\ufb01rst compute its classi\ufb01cation score in a forward operation. Then we obtain attention maps in a\nbackward operation by taking the partial derivative of the classi\ufb01cation score with respect to this\nsample. We use these maps as a regularization term coupled with the classi\ufb01cation loss to train the\nclassi\ufb01er. In the test stage, no attention maps are generated. The classi\ufb01er directly predicts the target\nlocation.\n\nhas been widely exploited for many computer vision applications, including image classi\ufb01cation\n[40, 47, 23, 24], image caption [52], pose estimation [14], etc. In visual tracking, the spatial weights,\nwhich are widely used by DCF trackers to suppress the boundary effect, can be interpreted as one\ntype of visual attention. Examples of such spatial weights include the cosine window map [4] and\nthe Gaussian window map [11, 13]. Recently, a number of efforts [7, 8, 6, 27] have been made to\nexploit visual attention within deep models. These approaches emphasize attentive features and resort\nto additional attention modules to generate feature weights. The classi\ufb01ers in these approaches are\nassumed not to be attentive. In this work, we propose a reciprocative learning process to learn an\nattentive classi\ufb01er in the tracking-by-detection framework. We exploit attention maps as regulariza-\ntion terms coupled with the original classi\ufb01cation loss for training classi\ufb01ers. Our deep classi\ufb01er\nwith reciprocative learning in itself can attend to temporal robust features to improve the tracking\naccuracy.\n\n3 Proposed Method\n\nWe propose a reciprocative learning scheme to activate the attentive ability of the classi\ufb01er in the\ntracking-by-detection framework. Figure 1 shows an overview. Different from the existing attention\nmodels [47, 23, 7, 8, 6, 27] proposing additional modules to produce attention maps, we use the\npartial derivatives of the network outputs with respect to input images as attention maps. While some\nvisualization methods [40, 39] use the attention maps to understand deep models, our method uses\nthe attention maps to serve as regularization terms in the training stage to help the classi\ufb01er attend to\ntarget regions robust to appearance changes. During testing, we directly use the classi\ufb01cation scores\nto locate target objects. In the following, we \ufb01rst introduce how to incorporate attention maps in the\noriginal classi\ufb01cation loss function. Then, we illustrate how the attention map gradually regularizes\nthe classi\ufb01er through reciprocative learning.\n\n3.1 Attention Exploitation\n\nWe \ufb01rst present how we exploit the visual attention within the tracking-by-detection network. We\ndenote by I the input to the CNN tracking-by-detection network. The network outputs a vector of\nscores. Each element score indicates how likely I belongs to a prede\ufb01ned class c. Given a speci\ufb01c\ninput sample I0, we use the \ufb01rst-order Taylor expansion [40] at a point z0 to approximate the score\nfunction fc(I) as follows:\n\n(1)\nThe point z0 belongs to the deleted \u0001-neighborhood of I0 (\u0001 \u2192 0). The approximation (Eq. 1) holds\ntrue for any point in the \u0001-neighborhood of I0. Therefore, the derivatives of fc(I) at the points z0 and\nI0 are equal (f(cid:48)\nc(I0)) as these two points are in\ufb01nitely close. In Eq. 1, Ac is the derivative\n\nc(z0) = f(cid:48)\n\nfc(I) \u2248 A(cid:62)\n\nc I + B.\n\n3\n\nAttention MapInput FrameAttention Regularization+Training:Classification loss + Attention regularizationPrediction:Binary classificationReciprocateForwardBackward\fof fc(I) with respect to the input I at the sample I0:\n\u2202fc(I)\n\nAc =\n\n\u2202I\n\n(cid:12)(cid:12)(cid:12)(cid:12)I=I0\n\n.\n\n(2)\n\nEq. 1 indicates that the output score of class c is affected by the element values of Ac. In other\nwords, the values of Ac indicate the importance of the corresponding pixels of I0 to generate the class\nscore. As such, we can interpret Ac as an attention map. For another speci\ufb01c input sample I1, we\nagain use Taylor expansion at a point z1 to approximate fc(I). The point z1 belongs to the deleted\n\u0001-neighborhood of I1. The new approximation holds true for any point in the \u0001-neighborhood of I1.\nThus, the attention map Ac corresponding to each input image sample is speci\ufb01c.\nAccording to Eq. 2, we compute the partial derivative of the network output fc(I) with respect to the\ninput I at one speci\ufb01c sample I0. This is achieved in two steps. First, we feed an input sample I0\ninto the network and obtain the predicted score fc(I0) in a forward propagation. Then, we take the\npartial derivative of fc(I) with respect to I when I = I0. According to the chain rule, this partial\nderivative is computed through backward propagation. We take the output of the \ufb01rst layer during\nbackward propagation as the attention map Ac. We only select the gradients with positive values as\nthey have clear contributions to the class scores with positive values. Thus, the attention map Ac is\nalways positive value and re\ufb02ects how the network attends to the input sample I0. Note that in the\nbackward propagation, the network parameters are \ufb01xed without updating.\n\n3.2 Attention Regularization\n\nThe tracking-by-detection framework usually de\ufb01nes the target object as the positive class and the\nbackground as the negative class to train a binary classi\ufb01er. For each input sample I0, we obtain\ntwo attention maps. One is the positive attention map (denoted by Ap) and the other is the negative\nattention map (denoted by An). For one positive training sample (labeled as y = 1), we expect the\npixel values of Ap related to target objects to be large. In comparison, the pixel values of An related\nto target objects should be small. The attention regularization term for one positive sample can be\nformulated as:\n\nR(y=1) =\n\n\u03c3Ap\n\u00b5Ap\n\n+\n\n\u00b5An\n\u03c3An\n\n,\n\n(3)\n\nwhere \u00b5 and \u03c3 are the mean and standard deviation operators for the attention maps. On the other hand,\nfor one negative training sample (denoted as y = 0), we formulate its corresponding regularization\nterm as:\n\nR(y=0) =\n\n\u00b5Ap\n\u03c3Ap\n\n+\n\n\u03c3An\n\u00b5An\n\n.\n\n(4)\n\nUsing Eq. 3 and Eq. 4, we add the attention regularization terms into the original classi\ufb01cation loss\nas:\n\nL = LCE + \u03bb \u00b7 [y \u00b7 R(y=1) + (1 \u2212 y) \u00b7 R(y=0)],\n\n(5)\nwhere R(y=1) and R(y=0) denote the regularization terms of the positive and negative training\nexamples, respectively. The scaler parameter \u03bb balances the attention regularization terms and the\ncross-entropy loss LCE.\nEq. 5 shows how attention maps contribute to training the deep classi\ufb01er. In addition to the classi-\n\ufb01cation loss, we incorporate the constraints from attention maps. For positive samples, we aim to\nincrease the attention around the target object in two aspects. The \ufb01rst one is to increase the mean\nbut decrease the standard deviation of Ap so that the pixel intensity values are large and with small\nvariance. The second one is to decrease the mean but increase the standard deviation of An so that the\npixel intensity values are small and with large variance. These two aspects re\ufb02ect that the classi\ufb01er\nlearns to increase the true positive rates while decreasing the false negative rates. A similar intuition\nis shown in Eq. 4 where we decrease the false positive rates and increase the true negative rates of the\nclassi\ufb01er. As a result, the regularization terms help in increasing the classi\ufb01cation accuracy by using\nthe constraint from attention maps. This contributes to the classi\ufb01er training process as the attention\nmaps heavily in\ufb02uence the output class scores as shown in Eq. 1.\n\n3.3 Reciprocative Learning\n\nBy incorporating the regularization terms in the loss function, the reciprocative learning algorithm\nis easy to implement. We resort to the standard backward propagation and chain rule. In each\n\n4\n\n\fIteration #50\n\nIteration #100\n\nIteration #150\n\nIteration #200\n\n(a) Training input\n\n(b) Without reciprocative learning\n\n(c) With reciprocative learning\n\nFigure 2: Visualization of attention maps on the Car4 sequence [50]. With the proposed reciprocative\nlearning algorithm, attention maps gradually cover the whole area of the target. The classi\ufb01er thus\nattends to every region which can differentiate the targe from the background.\n\niteration of the classi\ufb01er training, we compute the attention maps of every input training example.\nThese attention maps re\ufb02ect the attention of the classi\ufb01er at the current status. Ideally, the classi\ufb01er\nwill selectively pay more attention to target objects than the background. As shown in Figure 2(b),\nthe classi\ufb01er without using attention regularization terms tends to focus on a limited number of\ndiscriminative regions. When target objects undergo large appearance changes, those limited regions\nare unlikely to represent target objects robustly in the whole video. With the use of the attention\nregularization, the classi\ufb01er iteratively learns to attend to every region that can differentiate the target\nfrom the background. The classi\ufb01er gradually focuses its attention on the whole target area. Figure\n2(c) shows that the attention map only covers a subpart of the target region at the beginning of the\ntraining (i.e., Iteration #50). By means of reciprocative learning, the attention map gradually covers\nthe whole target region. In the test stage, the attention regularization terms are not used. The classi\ufb01er\nitself is able to attend to input samples.\n\n4 Tracking Process\n\nIn this section, we discuss how to carry out the tracking task on a video. The proposed tracking\nalgorithm does not require of\ufb02ine training. We mainly discuss three components regarding model\ninitialization, online detection and model update.\nModel initialization.\nIn the \ufb01rst frame, we randomly draw N1 samples around the initial target\nlocation. These samples are labeled as either positive or negative according to whether their inter-\nsection over union (IoU) scores with the ground truth annotations are greater than 0.5. We use H1\niterations in the initialization step. For each sample in every iteration, we compute its loss using Eq. 5\nand update the fully-connected layers accordingly.\nOnline detection. Given one frame in the video sequence, we \ufb01rst draw N2 samples around the\npredicted location of the target in the previous frame. Then we feed each sample into our network. We\nselect the candidate with the highest classi\ufb01cation score and re\ufb01ne the target location using bounding\nbox regression as in [16].\nModel update.\nIn each frame, we draw N2 samples around the predicted target location. These\nsamples are labeled as either positive or negative according to their IoU scores with the predicted\nbounding box. Then we use these samples to update the fully-connected layers using H2 iterations in\nevery T frames.\nWe analyze how reciprocative learning contributes to the tracking-by-detection framework by visu-\nalizing the attention map, con\ufb01dence map and the tracking results. The attention map shows how\nthe network attends to the input image. The scores in con\ufb01dence map indicate the probability of\nbeing the target object. We plot the predicted bounding boxes in red and the ground truth annotations\nin green. Figure 3 compares the visualization results with and without the reciprocative learning\nscheme. The state-of-the-art tracking-by-detection framework [35] is used as a baseline. We notice\n\n5\n\n#001\f(a) Without reciprocative learning\n\n(b) With reciprocative learning\n\nFigure 3: Visualization of the prediction process with and without reciprocative learning on the Girl\nsequence [50]. From the \ufb01rst to the third row are: attention maps, con\ufb01dence scores, and tracking\nresults (in red). Ground truth annotations are in green. With the proposed reciprocative learning\nalgorithm, the classi\ufb01er pays more attention to temporal robust features and well differentiate the\ntarget from obstructions with similar appearance in Frame #432.\n\nthat at the beginning of the video sequence (Frame #2), the attention map, con\ufb01dence map and the\ntracking results are almost the same between the baselines with and without reciprocative learning,\nrespectively. This means that the attention map has not regularized the classi\ufb01er well as it has not fully\nidenti\ufb01ed the target regions. However, along the tracking process, the reciprocative learning scheme\nhelps the attention map cover the whole target region and thus strengthens the instance awareness. As\na result, the network is able to produce a high con\ufb01dence map on the target region. This helps the\nclassi\ufb01er differentiate the target from the background during occlusion even when the obstructions\nare highly similar (i.e., both are human faces) in Frame #432 and Frame #467. In comparison, the\nbaseline without the reciprocative learning scheme drifts at the presence of occlusion as the classi\ufb01er\ndoes not attend to temporal robust features.\n\n5 Experiments\n\nIn this section, we \ufb01rst present the implementation details. Then we conduct ablation studies from\ntwo perspectives: 1) We investigate how the regularization terms contribute to learning discriminative\nclassi\ufb01ers. 2) To demonstrate the effectiveness of reciprocative learning, we compare with an\nalternative implementation using additional attention modules to generate feature weights. Finally, we\nevaluate our method on the standard benchmarks, i.e., OTB-2013 [50], OTB-2015 [51] and VOT-2016\n[28]. We present more experimental results in the supplementary materials, and will make the source\ncode available to the public.\n\nImplementation details. We use the same network architecture as in [35] to develop our baseline\ntracker. There are three \ufb01xed convolutional layers as the feature extractor and three learnable fully-\nconnected layers as the classi\ufb01er. The convolutional layers share the same weights with the VGG-M\nmodel [41]. The fully connected layers are randomly initialized and incrementally updated during\nthe tracking process. Our goal is to update the parameters of the fully-connected layers to learn an\nattentive classi\ufb01er. In the \ufb01rst frame, the number N1 of samples is set to 5500. We train the randomly\ninitialized classi\ufb01er using H1 = 50 iterations with a learning rate of 2e-4. In each iteration, we\nfeed 1 mini-batch containing 32 positive and 32 negative samples into the network. In the online\nmodel update step, we \ufb01ne-tune the classi\ufb01er using H2 = 15 iterations in every T = 10 frames\nwith a learning rate of 3e-4. The network solver is stochastic gradient descent (SGD). During online\ndetection, the number N2 of proposals is set to 256. Our implementation is based on pytorch [37]\nand runs on a PC with an i7-3.4 GHz CPU and a GeForce GTX 1080 GPU. The average tracking\nspeed is 1 FPS.\n\n6\n\n#002#432#467#002#432#467#002#432#467#002#432#467#002#432#467#002#432#467\fTable 1: Parameter sensitivity analysis of \u03bb on the OTB-2013 dataset. Red: best. Blue: second best.\n\n\u03bb\n\nDP\nOS\n\n0\n\n0.911\n0.671\n\n1\n\n0.917\n0.688\n\n2\n\n0.936\n0.694\n\n3\n\n0.941\n0.704\n\n4\n\n0.940\n0.704\n\n5\n\n0.944\n0.704\n\n6\n\n0.938\n0.701\n\n7\n\n0.929\n0.695\n\n8\n\n0.911\n0.684\n\nTable 2: Evaluation of baseline improvements using attentive features and attentive classi\ufb01ers on the\nOTB-2013 dataset. Red: best. Blue: second best.\n\nBaseline\n\n0.911\n0.671\n\nBaseline\n\n+ Attentive Features\n\nBaseline\n\n+ Attentive Classi\ufb01er (Ours)\n\n0.932\n0.686\n\n0.944\n0.704\n\nDP\nOS\n\nEvaluation metrics. We follow the standard benchmark protocols on the OTB-2013 and OTB-2015\ndatasets. We report the distance precision (DP) and overlap success (OS) rates under the one-pass\nevaluation (OPE). The distance precision rate measures the center pixel distance between the predicted\nlocations and the ground truth annotations. We report the DP rate at a threshold of 20 pixels. The\noverlap success rate measures the overlap between the predicted bounding boxes and the ground truth\nannotations. We report the area-under-the-curve scores of the OS rate plots. In addition, we compute\nthe average center distance errors in pixels, e.g., the center location error (CLE), and the overlap\nsuccess rates at a threshold of 0.5 IoU (OS0.5). On the VOT-2016 dataset, we use three evaluation\nmetrics: expected average overlap (EAO), accuracy ranks (Ar) and robustness ranks (Rr).\n\n5.1 Ablation Studies\n\nWe propose the reciprocative learning algorithm to extend the tracking-by-detection framework. In\nthis section, we investigate how reciprocative learning contributes to learning discriminative classi\ufb01ers.\nNote that we exploit attention maps as a regularization term in the loss function. The scalar parameter\n\u03bb in Eq. 5 makes a balance between the classi\ufb01cation loss and the attention regularization term. We\nset \u03bb between 0 to 8 at an interval of 1 to evaluate the tracking performance on the OTB-2013 dataset.\nTable 1 shows that reciprocative learning consistently improves the baseline tracker (i.e., \u03bb = 0) even\nthe value of \u03bb is in a wide range. This af\ufb01rms the effectiveness of the proposed reciprocative learning\nalgorithm. Our method generally achieves top performing results when \u03bb is between 3 and 5. In the\nfollowing experiments, we \ufb01x \u03bb = 5 to report our tracking results.\nNote that using attention maps as feature weights can also improve the tracking-by-detection frame-\nwork as in [7, 8, 6, 27]. For fair comparison, we implement this strategy on top of our baseline by\nadding an attention module to generate feature weights. The classi\ufb01er is learned on the attentive fea-\ntures with the original classi\ufb01cation loss. Table 2 shows that using attentive features indeed improves\nthe baseline tracking performance. However, the baseline tracker with reciprocative learning achieves\nmuch larger performance gains than that with the attentive feature scheme: 3.2% vs. 2.1% in DP,\nand 3.3% vs. 1.5% in OS. It is because our deep classi\ufb01er itself learns to attend to the target object\nthrough attention regularization. While the attentive features learned in single frames are unlikely to\nbe invariant to large appearance changes over a long temporal span.\n\n5.2 Overall Performance\n\nOTB-2013 dataset. We compare our method with 57 trackers. Among them 29 trackers are from\nthe OTB benchmarks [50, 51]. The remaining 28 state-of-the-art trackers include ACFN [6], BACF\n[26], ADNet [53], CCOT [13], CREST [43], MCPF [57], CSR-DCF [31], SRDCFD [12], SINT [45],\nMDNet[35], HDT [38], Staple [2], GOTURN [19], KCF [21], TGPR [15], CNT [55], DSST [9],\nMEEM [54], RPT [30], SAMF [29], DLSSVM [36], BIT [5], SO-DLT [49], SCT [7], CNN-SVM\n[22], FCNT [48], HCF [34] and HART [27]. For presentation clarity, we only display the top 10\ntrackers. Figure 4 shows that our method generally performs well against state-of-the-art approaches\non the OTB-2013 dataset with the distance precision and overlap success metrics. Speci\ufb01cally, our\nmethod improves the baseline MDNet by a large margin, which is not of\ufb02ine trained on auxiliary\n\n7\n\n\fFigure 4: Distance precision and overlap success plots using the one-pass evaluation on the OTB-2013\nand OTB-2015 datasets.\n\nTable 3: Comparisons with the state-of-the-art trackers on the OTB-2013 and OTB-2015 datasets. Our\ntracker performs favorably against existing trackers in center location error (CLE), and the overlap\nsuccess rate at a threshold of 0.5 IoU (OS0.5). Red: best. Blue: second best.\n\nTrackers\n\nOurs CCOT CREST MDNet MCPF ADNet BACF SINT SRDCFD ACFN\n\nCLE\n\nOS0.5\n\nOTB-2013\nOTB-2015\nOTB-2013\nOTB-2015\n\n15.58\n13.99 21.19\n\n7.85\n12.42\n0.913 0.837\n0.848\n\n0.823 0.776\n\n10.22 13.05\n16.87\n0.860 0.849\n0.806\n\n11.19\n20.86\n0.858\n0.780\n\n13.79\n14.65\n0.836\n0.802\n\n26.20\n28.10\n0.840\n0.776\n\n11.85\n25.75\n0.816\n0.719\n\n29.51\n31.52\n0.814\n0.766\n\n18.69\n25.14\n0.750\n0.686\n\nsequences for fair comparison. Table 3 shows the results in center location error and OS0.5. The\nfavorable results against state-of-the-art approaches demonstrate the effectiveness of our method in\nsigni\ufb01cantly reducing the average center distance error and increasing the tracking success rates.\n\nOTB-2015 dataset. We compare our method with aforementioned trackers on the OTB-2015\ndataset. Figure 4 shows that our tracker overall performs well against the state-of-the-art. While\nthe top performing tracker CCOT achieves higher distance precision results, Table 3 shows that our\ntracker has a smaller center location error on all the benchmark sequences than CCOT. The overall\nfavorable performance of our tracker can be explained by the fact that the proposed reciprocative\nlearning algorithm strengthens the discriminative power of the classi\ufb01er. The compared methods do\nnot explicitly exploit visual attention. Our tracker does not perform as well as the top performing\ntracker CCOT in overlap success rate. It is because our tracker randomly draws a sparse set of\nsamples for scale estimation. But CCOT crops the samples in a continuous space. We will explore\nthis idea in the future work.\n\nVOT-2016 dataset. We evaluate our method on the VOT-2016 dataset with the comparison to\nstate-of-the-art trackers including Staple [2], MDNet [35], CCOT [13], EBT [58], DeepSRDCF\n[10], and SiamFC [3]. Table 4 shows that CCOT performs the best under the EAO metric. Our\ntracker overall performs comparably with CCOT, but much better than the others. The VOT-2016\nreport suggests that trackers whose EAO value exceeds 0.251 belong to the state-of-the-art. All these\ncompared trackers including ours are thus state-of-the-art.\n\nTable 4: Comparisons with the state-of-the-art trackers on the VOT-2016 dataset. The results are\npresented in terms of expected average overlap (EAO), accuracy rank (Ar) and robustness rank (Rr).\nRed: best. Blue: second best.\n\nTrackers\n\nEAO\nAr\nRr\n\nOurs\n\n0.320\n1.47\n2.10\n\nCCOT\n\nStaple\n\nMDNet\n\n0.331\n1.98\n1.95\n\n0.295\n1.87\n3.23\n\n0.257\n1.72\n2.80\n\nEBT\n\n0.291\n3.62\n2.13\n\nDSRDCF\n\nSiamFC\n\n0.276\n2.12\n2.82\n\n0.277\n1.30\n3.17\n\n8\n\nLocation error threshold (pixels)01020304050Precision00.20.40.60.81Distance precision plots - OTB2013Ours [0.944]MCPF [0.916]MDNet [0.911]CREST [0.908]CCOT [0.908]ADNet [0.903]HDT [0.889]SINT [0.882]SRDCFD [0.870]BACF [0.861]Overlap threshold (IoU)00.20.40.60.81Success rate00.20.40.60.81Overlap success plots - OTB2013Ours [0.704]CCOT [0.677]MCPF [0.677]CREST [0.673]MDNet [0.671]ADNet [0.659]BACF [0.657]SINT [0.655]SRDCFD [0.653]DLSSVM [0.608]Location error threshold (pixels)01020304050Precision00.20.40.60.81Distance precision plots - OTB2015CCOT [0.903]Ours [0.895]ADNet [0.880]MDNet [0.878]MCPF [0.873]HDT [0.848]CREST [0.838]SRDCFD [0.825]BACF [0.824]CNN-SVM [0.814]Overlap threshold (IoU)00.20.40.60.81Success rate00.20.40.60.81Overlap success plots - OTB2015CCOT [0.673]Ours [0.668]ADNet [0.646]MDNet [0.646]MCPF [0.628]SRDCFD [0.627]CREST [0.623]BACF [0.621]SINT [0.592]ACFN [0.569]\f6 Concluding Remarks\n\nIn this paper, we propose a reciprocative learning scheme to exploit visual attention within the\ntracking-by-detection framework. For each input sample, we \ufb01rst compute the classi\ufb01cation loss\nin a forward propagation, and take the partial derivatives with respect to this sample in a backward\npropagation as attention maps. Then we use attention maps as a regularization term coupled with the\noriginal classi\ufb01cation loss function for training discriminative classi\ufb01ers. Compared with existing\nattention models proposing additional modules to generate feature weights, the proposed reciprocative\nlearning algorithm uses attention maps to regularize the classi\ufb01er learning. Our classi\ufb01er learns\nto attend to the robust features over a long temporal span. In the test stage, no attention maps\nare generated. Our classi\ufb01er directly classi\ufb01es each input sample. Extensive evaluations on the\nbenchmarks demonstrate that our method performs favorably against state-of-the-art approaches.\n\nAcknowledgements\n\nThe work is supported in part by the Beijing Municipal Science and Technology Commission project\nunder Grant No. Z181100001918005, Fundamental Research Funds for the Central Universities\n(2017RC08), NSF CAREER Grant No. 1149783, and gifts from NVIDIA. Shi Pu is supported by a\nscholarship from China Scholarship Council.\n\nReferences\n[1] B. Babenko, M.-H. Yang, and S. Belongie. Robust object tracking with online multiple instance\n\nlearning. IEEE PAMI, 2011.\n\n[2] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. Torr. Staple: Complementary\n\nlearners for real-time tracking. In CVPR, 2016.\n\n[3] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr. Fully-convolutional\n\nsiamese networks for object tracking. In ECCVW, 2016.\n\n[4] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui. Visual object tracking using adaptive\n\ncorrelation \ufb01lters. In CVPR, 2010.\n\n[5] B. Cai, X. Xu, X. Xing, K. Jia, J. Miao, and D. Tao. Bit: Biologically inspired tracker. IEEE\n\nTIP, 2016.\n\n[6] J. Choi, H. J. Chang, S. Yun, T. Fischer, Y. Demiris, J. Y. Choi, et al. Attentional correlation\n\n\ufb01lter network for adaptive visual tracking. In CVPR, 2017.\n\n[7] J. Choi, H. Jin Chang, J. Jeong, Y. Demiris, and J. Young Choi. Visual tracking using attention-\n\nmodulated disintegration and integration. In CVPR, 2016.\n\n[8] Q. Chu, W. Ouyang, H. Li, X. Wang, B. Liu, and N. Yu. Online multi-object tracking using\n\ncnn-based single object tracker with spatial-temporal attention mechanism. In CVPR, 2017.\n\n[9] M. Danelljan, G. H\u00e4ger, F. Khan, and M. Felsberg. Accurate scale estimation for robust visual\n\ntracking. In BMVC, 2014.\n\n[10] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. Convolutional features for\n\ncorrelation \ufb01lter based visual tracking. In ICCVW, 2015.\n\n[11] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. Learning spatially regularized\n\ncorrelation \ufb01lters for visual tracking. In ICCV, 2015.\n\n[12] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. Adaptive decontamination of the\n\ntraining set: A uni\ufb01ed formulation for discriminative visual tracking. In CVPR, 2016.\n\n[13] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. Beyond correlation \ufb01lters: Learning\n\ncontinuous convolution operators for visual tracking. In ECCV, 2016.\n\n9\n\n\f[14] W. Du, Y. Wang, and Y. Qiao. Rpan: An end-to-end recurrent pose-attention network for action\n\nrecognition in videos. In CVPR, 2017.\n\n[15] J. Gao, H. Ling, W. Hu, and J. Xing. Transfer learning based visual tracking with gaussian\n\nprocesses regression. In ECCV, 2014.\n\n[16] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object\n\ndetection and semantic segmentation. In CVPR, 2014.\n\n[17] H. Grabner, M. Grabner, and H. Bischof. Real-time tracking via on-line boosting. In BMVC,\n\n2006.\n\n[18] S. Hare, S. Golodetz, A. Saffari, V. Vineet, M.-M. Cheng, S. L. Hicks, and P. H. Torr. Struck:\n\nStructured output tracking with kernels. IEEE PAMI, 2016.\n\n[19] D. Held, S. Thrun, and S. Savarese. Learning to track at 100 fps with deep regression networks.\n\nIn ECCV, 2016.\n\n[20] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Exploiting the circulant structure of\n\ntracking-by-detection with kernels. In ECCV, 2012.\n\n[21] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized\n\ncorrelation \ufb01lters. IEEE PAMI, 2015.\n\n[22] S. Hong, T. You, S. Kwak, and B. Han. Online tracking by learning discriminative saliency map\n\nwith convolutional neural network. In ICML, 2015.\n\n[23] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv preprint: 1709.01507,\n\n2017.\n\n[24] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, 2015.\n\n[25] Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection. IEEE PAMI, 2012.\n\n[26] H. Kiani Galoogahi, A. Fagg, and S. Lucey. Learning background-aware correlation \ufb01lters for\n\nvisual tracking. In CVPR, 2017.\n\n[27] A. Kosiorek, A. Bewley, and I. Posner. Hierarchical attentive recurrent tracking. In NIPS, 2017.\n\n[28] M. Kristan and et al. The visual object tracking vot2016 challenge results. In ECCVW, 2016.\n\n[29] Y. Li and J. Zhu. A scale adaptive kernel correlation \ufb01lter tracker with feature integration. In\n\nECCVW, 2014.\n\n[30] Y. Li, J. Zhu, and S. C. Hoi. Reliable patch trackers: Robust visual tracking by exploiting\n\nreliable patches. In CVPR, 2015.\n\n[31] A. Lukezic, T. Voj\u00edr, L. C. Zajc, J. Matas, and M. Kristan. Discriminative correlation \ufb01lter with\n\nchannel and spatial reliability. In CVPR, 2017.\n\n[32] W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y. Wang. End-to-end active object tracking via\n\nreinforcement learning. In ICML, 2018.\n\n[33] W. Luo, J. Xing, A. Milan, X. Zhang, W. Liu, X. Zhao, and T.-K. Kim. Multiple object tracking:\n\nA literature review. arXiv preprint: 1409.7618, 2014.\n\n[34] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang. Hierarchical convolutional features for visual\n\ntracking. In ICCV, 2015.\n\n[35] H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking.\n\nIn CVPR, 2016.\n\n[36] J. Ning, J. Yang, S. Jiang, L. Zhang, and M.-H. Yang. Object tracking via dual linear structured\n\nsvm and explicit feature map. In CVPR, 2016.\n\n10\n\n\f[37] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\n\nL. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.\n\n[38] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M.-H. Yang. Hedged deep tracking. In\n\nCVPR, 2016.\n\n[39] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual\n\nexplanations from deep networks via gradient-based localization. In ICCV, 2017.\n\n[40] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising\n\nimage classi\ufb01cation models and saliency maps. arXiv preprint: 1312.6034, 2013.\n\n[41] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. In ICLR, 2015.\n\n[42] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah. Visual\n\ntracking: An experimental survey. IEEE PAMI, 2014.\n\n[43] Y. Song, C. Ma, L. Gong, J. Zhang, R. W. Lau, and M.-H. Yang. Crest: Convolutional residual\n\nlearning for visual tracking. In ICCV, 2017.\n\n[44] Y. Song, C. Ma, X. Wu, L. Gong, L. Bao, W. Zuo, C. Shen, L. Rynson, and M.-H. Yang. Vital:\n\nVisual tracking via adversarial learning. In CVPR, 2018.\n\n[45] R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance search for tracking. In CVPR, 2016.\n\n[46] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. Torr. End-to-end representation\n\nlearning for correlation \ufb01lter based tracking. In CVPR, 2017.\n\n[47] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention\n\nnetwork for image classi\ufb01cation. In CVPR, 2017.\n\n[48] L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual tracking with fully convolutional networks.\n\nIn ICCV, 2015.\n\n[49] N. Wang, S. Li, A. Gupta, and D.-Y. Yeung. Transferring rich feature hierarchies for robust\n\nvisual tracking. arXiv preprint: 1501.04587, 2015.\n\n[50] Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: A benchmark. In CVPR, 2013.\n\n[51] Y. Wu, J. Lim, and M.-H. Yang. Object tracking benchmark. IEEE PAMI, 2015.\n\n[52] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show,\n\nattend and tell: Neural image caption generation with visual attention. In ICML, 2015.\n\n[53] S. Yun, J. Choi, Y. Yoo, K. Yun, and J. Y. Choi. Action-decision networks for visual tracking\n\nwith deep reinforcement learning. In CVPR, 2017.\n\n[54] J. Zhang, S. Ma, and S. Sclaroff. Meem: robust tracking via multiple experts using entropy\n\nminimization. In ECCV, 2014.\n\n[55] K. Zhang, Q. Liu, Y. Wu, and M.-H. Yang. Robust visual tracking via convolutional networks\n\nwithout training. IEEE TIP, 2016.\n\n[56] L. Zhang, J. Varadarajan, P. N. Suganthan, N. Ahuja, and P. Moulin. Robust visual tracking\n\nusing oblique random forests. In CVPR, 2017.\n\n[57] T. Zhang, C. Xu, and M.-H. Yang. Multi-task correlation particle \ufb01lter for robust object tracking.\n\nIn CVPR, 2017.\n\n[58] G. Zhu, F. Porikli, and H. Li. Beyond local search: Tracking objects everywhere with instance-\n\nspeci\ufb01c proposals. In CVPR, 2016.\n\n11\n\n\f", "award": [], "sourceid": 979, "authors": [{"given_name": "Shi", "family_name": "Pu", "institution": "Beijing University of Posts and Telecommunications"}, {"given_name": "Yibing", "family_name": "Song", "institution": "Tencent AI Lab"}, {"given_name": "Chao", "family_name": "Ma", "institution": "University of Adelaide"}, {"given_name": "Honggang", "family_name": "Zhang", "institution": "Beijing University of Posts and Telecommunications"}, {"given_name": "Ming-Hsuan", "family_name": "Yang", "institution": "UC Merced / Google"}]}