{"title": "Hierarchical Attentive Recurrent Tracking", "book": "Advances in Neural Information Processing Systems", "page_first": 3053, "page_last": 3061, "abstract": "Class-agnostic object tracking is particularly difficult in cluttered environments as target specific discriminative models cannot be learned a priori. Inspired by how the human visual cortex employs spatial attention and separate ``where'' and ``what'' processing pathways to actively suppress irrelevant visual features, this work develops a hierarchical attentive recurrent model for single object tracking in videos. The first layer of attention discards the majority of background by selecting a region containing the object of interest, while the subsequent layers tune in on visual features particular to the tracked object.    This framework is fully differentiable and can be trained in a purely data driven fashion by gradient methods. To improve training convergence, we augment the loss function with terms for auxiliary tasks relevant for tracking. Evaluation of the proposed model is performed on two datasets: pedestrian tracking on the KTH activity recognition dataset and the more difficult KITTI object tracking dataset.", "full_text": "Hierarchical Attentive Recurrent Tracking\n\nAdam R. Kosiorek\n\nDepartment of Engineering Science\n\nUniversity of Oxford\n\nadamk@robots.ox.ac.uk\n\nAlex Bewley\n\nDepartment of Engineering Science\n\nUniversity of Oxford\n\nbewley@robots.ox.ac.uk\n\nIngmar Posner\n\nDepartment of Engineering Science\n\nUniversity of Oxford\n\ningmar@robots.ox.ac.uk\n\nAbstract\n\nClass-agnostic object tracking is particularly dif\ufb01cult in cluttered environments\nas target speci\ufb01c discriminative models cannot be learned a priori. Inspired by\nhow the human visual cortex employs spatial attention and separate \u201cwhere\u201d and\n\u201cwhat\u201d processing pathways to actively suppress irrelevant visual features, this\nwork develops a hierarchical attentive recurrent model for single object tracking\nin videos. The \ufb01rst layer of attention discards the majority of background by\nselecting a region containing the object of interest, while the subsequent layers\ntune in on visual features particular to the tracked object. This framework is\nfully differentiable and can be trained in a purely data driven fashion by gradient\nmethods. To improve training convergence, we augment the loss function with\nterms for auxiliary tasks relevant for tracking. Evaluation of the proposed model is\nperformed on two datasets: pedestrian tracking on the KTH activity recognition\ndataset and the more dif\ufb01cult KITTI object tracking dataset.\n\nIntroduction\n\n1\nIn computer vision, designing an algorithm for model-free tracking of anonymous objects is challeng-\ning, since no target-speci\ufb01c information can be gathered a priori and yet the algorithm has to handle\ntarget appearance changes, varying lighting conditions and occlusion. To make it even more dif\ufb01cult,\nthe tracked object often constitutes but a small fraction of the visual \ufb01eld. The remaining parts may\ncontain distractors, which are visually salient objects resembling the target but hold no relevant\ninformation. Despite this fact, recent models often process the whole image, which exposes them\nto noise and increases the associated computational cost or they use heuristic methods to decrease\nthe size of search regions. This in contrast to human visual perception, which does not process the\nvisual \ufb01eld in its entirety, but rather acknowledges it brie\ufb02y and focuses on processing small fractions\nthereof, which we dub visual attention.\nAttention mechanisms have recently been explored in machine learning in a wide variety of contexts\n[27, 14], often providing new capabilities to machine learning algorithms [11, 12, 7]. While they\nimprove ef\ufb01ciency [22] and performance on state-of-the-art machine learning benchmarks [27], their\narchitecture is much simpler than that of the mechanisms found in the human visual cortex [5].\nAttention has also been long studied by neuroscientists [18], who believe that it is crucial for visual\nperception and cognition [4], since it is inherently tied to the architecture of the visual cortex and\ncan affect the information \ufb02ow inside it. Whenever more than one visual stimulus is present in the\nreceptive \ufb01eld of a neuron, all the stimuli compete for computational resources due to the limited\nprocessing capacity. Visual attention can lead to suppression of distractors by reducing the size of\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: KITTI image with\nthe ground-truth and predicted\nbounding boxes and an atten-\ntion glimpse. The lower row\ncorresponds to the hierarchi-\ncal attention of our model:\n1st layer extracts an attention\nglimpse (a), the 2nd layer uses\nappearance attention to build\na location map (b). The 3rd\nlayer uses the location map\nto suppress distractors, visu-\nalised in (c).\n\nthe receptive \ufb01eld of a neuron and by increasing sensitivity at a given location in the visual \ufb01eld\n(spatial attention). It can also amplify activity in different parts of the cortex, which are specialised in\nprocessing different types of features, leading to response enhancement with respect to those features\n(appearance attention). The functional separation of the visual cortex is most apparent in two distinct\nprocessing pathways. After leaving the eye, the sensory inputs enter the primary visual cortex (known\nas V1) and then split into the dorsal stream, responsible for estimating spatial relationships (where),\nand the ventral stream, which targets appearance-based features (what).\nInspired by the general architecture of the human visual cortex and the role of attention mechanisms,\nthis work presents a biologically-inspired recurrent model for single object tracking in videos (cf.\nsection 3). Tracking algorithms typically use simple motion models and heuristics to decrease the size\nof the search region. It is interesting to see whether neuroscienti\ufb01c insights can aid our computational\nefforts, thereby improving the ef\ufb01ciency and performance of single object tracking. It is worth noting\nthat visual attention can be induced by the stimulus itself (due to, e. g., high contrast) in a bottom-up\nfashion or by back-projections from other brain regions and working memory as top-down in\ufb02uence.\nThe proposed approach exploits this property to create a feedback loop that steers the three layers of\nvisual attention mechanisms in our hierarchical attentive recurrent tracking (HART) framework, see\nFigure 1. The \ufb01rst stage immediately discards spatially irrelevant input, while later stages focus on\nproducing target-speci\ufb01c \ufb01lters to emphasise visual features particular to the object of interest.\nThe resulting framework is end-to-end trainable and we resort to maximum likelihood estimation\n(MLE) for parameter learning. This follows from our interest in estimating the distribution over object\nlocations in a sequence of images, given the initial location from whence our tracking commenced.\nFormally, given a sequence of images x1:T \u2208 RH\u00d7W\u00d7C, where the superscript denotes height, width\nand the number of channels of the image, respectively, and an initial location for the tracked object\ngiven by a bounding box b1 \u2208 R4, the conditional probability distribution factorises as\n\n(cid:90)\n\n(cid:90)\n\nT(cid:89)\n\nt=2\n\np(b2:T | x1:T , b1) =\n\np(h1 | x1, b1)\n\np(bt | ht)p(ht | xt, bt\u22121, ht\u22121) dht dh1,\n\n(1)\n\nwhere we assume that motion of an object can be described by a Markovian state ht. Our bounding\n\nbox estimates are given by(cid:98)b2:T , found by the MLE of the model parameters. In sum, our contributions\n\nare threefold: Firstly, a hierarchy of attention mechanisms that leads to suppressing distractors and\ncomputational ef\ufb01ciency is introduced. Secondly, a biologically plausible combination of attention\nmechanisms and recurrent neural networks is presented for object tracking. Finally, our attention-\nbased tracker is demonstrated using real-world sequences in challenging scenarios where previous\nrecurrent attentive trackers have failed.\nNext we brie\ufb02y review related work (Section 2) before describing how information \ufb02ows through the\ncomponents of our hierarchical attention in Section 3. Section 4 details the losses applied to guide\nthe attention. Section 5 presents experiments on KTH and KITTI datasets with comparison to related\nattention-based trackers. Section 6 discusses the results and intriguing properties of our framework\nand Section 7 concludes the work. Code and results are available online1.\n\n1https://github.com/akosiorek/hart\n\n2\n\n\f2 Related Work\nA number of recent studies have demonstrated that visual content can be captured through a sequence\nof spatial glimpses or foveation [22, 12]. Such a paradigm has the intriguing property that the\ncomputational complexity is proportional to the number of steps as opposed to the image size.\nFurthermore, the fovea centralis in the retina of primates is structured with maximum visual acuity\nin the centre and decaying resolution towards the periphery, Cheung et al. [4] show that if spatial\nattention is capable of zooming, a regular grid sampling is suf\ufb01cient. Jaderberg et al. [14] introduced\nthe spatial transformer network (STN) which provides a fully differentiable means of transforming\nfeature maps, conditioned on the input itself. Eslami et al. [7] use the STN as a form of attention\nin combination with a recurrent neural network (RNN) to sequentially locate and identify objects\nin an image. Moreover, Eslami et al. [7] use a latent variable to estimate the presence of additional\nobjects, allowing the RNN to adapt the number of time-steps based on the input. Our spatial\nattention mechanism is based on the two dimensional Gaussian grid \ufb01lters of [16] which is both fully\ndifferentiable and more biologically plausible than the STN.\nWhilst focusing on a speci\ufb01c location has its merits, focusing on particular appearance features might\nbe as important. A policy with feedback connections can learn to adjust \ufb01lters of a convolutional\nneural network (CNN), thereby adapting them to features present in the current image and improving\naccuracy [25]. De Brabandere et al. [6] introduced dynamic \ufb01lter network (DFN), where \ufb01lters for a\nCNN are computed on-the-\ufb02y conditioned on input features, which can reduce model size without\nperformance loss. Karl et al. [17] showed that an input-dependent state transitions can be helpful\nfor learning latent Markovian state-space system. While not the focus of this work, we follow this\nconcept in estimating the expected appearance of the tracked object.\nIn the context of single object tracking, both attention mechanisms and RNNs appear to be perfectly\nsuited, yet their success has mostly been limited to simple monochromatic sequences with plain\nbackgrounds [16]. Cheung [3] applied STNs [14] as attention mechanisms for real-world object\ntracking, but failed due to exploding gradients potentially arising from the dif\ufb01culty of the data. Ning\net al. [23] achieved competitive performance by using features from an object detector as inputs to a\nlong-short memory network (LSTM), but requires processing of the whole image at each time-step.\nTwo recent state-of-the-art trackers employ convolutional Siamese networks which can be seen as\nan RNN unrolled over two time-steps [13, 26]. Both methods explicitly process small search areas\naround the previous target position to produce a bounding box offset [13] or a correlation response\nmap with the maximum corresponding to the target position [26]. We acknowledge the recent work2\nof Gordon et al. [10] which employ an RNN based model and use explicit cropping and warping as a\nform of non-differentiable spatial attention. The work presented in this paper is closest to [16] where\nwe share a similar spatial attention mechanism which is guided through an RNN to effectively learn\na motion model that spans multiple time-steps. The next section describes our additional attention\nmechanisms in relation to their biological counterparts.\n3 Hierarchical Attention\n\nxt\n\nSpatial\nAttention\n\ngt\n\nV1\n\nDorsal\nStream\n\nVentral\nStream\n\nst\n\n(cid:12)\n\n\u03bdt\n\nht\u22121\n\nvt\n\nLSTM\n\not\n\nht\n\n(cid:18)\u02dcst\n\n(cid:19)\n\not\n\n\u03b1t+1\n\nMLP\n\nat+1\n\n\u2206(cid:98)bt\n\nFigure 2: Hierarchical Attentive Recurrent Tracking. Spatial attention extracts a glimpse gt from\nthe input image xt. V1 and the ventral stream extract appearance-based features \u03bdt while the dorsal\nstream computes a foreground/background segmentation st of the attention glimpse. Masked features\nvt contribute to the working memory ht. The LSTM output ot is then used to compute attention\n\nat+1, appearance \u03b1t+1 and a bounding box correction \u2206(cid:98)bt. Dashed lines correspond to temporal\n\nconnections, while solid lines describe information \ufb02ow within one time-step.\n\n2[10] only became available at the time of submitting this paper.\n\n3\n\n\f\u03b1t\n\nDFN\n\ngt\n\nShared\nCNN\n\nCNN (cid:12)\n\nvt\n\nFigure 3: Architecture of the appearance attention. V1 is im-\nplemented as a CNN shared among the dorsal stream (DFN)\nand the ventral stream (CNN). The (cid:12) symbol represents\nthe Hadamard product and implements masking of visual\nfeatures by the foreground/background segmentation.\n\nInspired by the architecture of the human visual cortex, we structure our system around working\nmemory responsible for storing the motion pattern and an appearance description of the tracked object.\nIf both quantities were known, it would be possible to compute the expected location of the object at\nthe next time step. Given a new frame, however, it is not immediately apparent which visual features\ncorrespond to the appearance description. If we were to pass them on to an RNN, it would have to\nimplicitly solve a data association problem. As it is non-trivial, we prefer to model it explicitly by\noutsourcing the computation to a separate processing stream conditioned on the expected appearance.\nThis results in a location-map, making it possible to neglect features inconsistent with our memory of\nthe tracked object. We now proceed with describing the information \ufb02ow in our model.\nGiven attention parameters at, the spatial attention module extracts a glimpse gt from the input\nimage xt. We then apply appearance attention, parametrised by appearance \u03b1t and comprised of\nV1 and dorsal and ventral streams, to obtain object-speci\ufb01c features vt, which are used to update\nthe hidden state ht of an LSTM. The LSTM\u2019s output is then decoded to predict both spatial and\n\nappearance attention parameters for the next time-step along with a bounding box correction \u2206(cid:98)bt for\n\nthe current time-step. Spatial attention is driven by top-down signal at, while appearance attention\ndepends on top-down \u03b1t as well as bottom-up (contents of the glimpse gt) signals. Bottom-up\nsignals have local in\ufb02uence and depend on stimulus salience at a given location, while top-down\nsignals incorporate global context into local processing. This attention hierarchy, further enhanced by\nrecurrent connections, mimics that of the human visual cortex [18]. We now describe the individual\ncomponents of the system.\n\nSpatial Attention Our spatial attention mechanism is similar to the one used by Kaho\u00fa et al. [16].\nt \u2208 Rh\u00d7H,\nGiven an input image xt \u2208 RH\u00d7W , it creates two matrices Ax\nrespectively. Each matrix contains one Gaussian per row; the width and positions of the Gaussians\ndetermine which parts of the image are extracted as the attention glimpse. Formally, the glimpse\ngt \u2208 Rh\u00d7w is de\ufb01ned as\n\nt \u2208 Rw\u00d7W and Ay\n\nt )T .\n\ngt = Ay\n\nt xt (Ax\n\n(2)\nAttention is described by centres \u00b5 of the Gaussians, their variances \u03c32 and strides \u03b3 between\ncenters of Gaussians of consecutive rows of the matrix, one for each axis. In contrast to the work\nby Kaho\u00fa et al. [16], only centres and strides are estimated from the hidden state of the LSTM,\nwhile the variance depends solely on the stride. This prevents excessive aliasing during training\ncaused when predicting a small variance (compared to strides) leading to smoother convergence. The\nrelationship between variance and stride is approximated using linear regression with polynomial\nbasis functions (up to 4th order) before training the whole system. The glimpse size we use depends\non the experiment.\nAppearance Attention This stage transforms the attention glimpse gt into a \ufb01xed-dimensional\nvector vt comprising appearance and spatial information about the tracked object. Its architecture\ndepends on the experiment. In general, however, we implement V1 : Rh\u00d7w \u2192 Rhv\u00d7wv\u00d7cv as a\nnumber of convolutional and max-pooling layers. They are shared among later processing stages,\nwhich corresponds to the primary visual cortex in humans [5]. Processing then splits into ventral and\ndorsal streams. The ventral stream is implemented as a CNN, and handles visual features and outputs\nfeature maps \u03bdt. The dorsal stream, implemented as a DFN, is responsible for handling spatial\nrelationships. Let MLP(\u00b7) denote a multi-layered perceptron. The dorsal stream uses appearance \u03b1t\nto dynamically compute convolutional \ufb01lters \u03c8a\u00d7b\u00d7c\u00d7d\n, where the superscript denotes the size of\nthe \ufb01lters and the number of input and output feature maps, as\n\n(cid:110)\n\nt\n\n(cid:111)K\n\ni=1\n\n\u03a8t =\n\n\u03c8ai\u00d7bi\u00d7ci\u00d7di\n\nt\n\n= MLP(\u03b1t).\n\n(3)\n\nThe \ufb01lters with corresponding nonlinearities form K convolutional layers applied to the output of V1.\nFinally, a convolutional layer with a 1 \u00d7 1 kernel and a sigmoid non-linearity is applied to transform\nthe output into a spatial Bernoulli distribution st. Each value in st represents the probability of the\ntracked object occupying the corresponding location.\n\n4\n\n\fThe location map of the dorsal stream is combined with appearance-based features extracted by the\nventral stream, to imitate the distractor-suppressing behaviour of the human brain. It also prevents\ndrift and allows occlusion handling, since object appearance is not overwritten in the hidden state\nwhen input does not contain features particular to the tracked object. Outputs of both streams are\ncombined as3\n\nvt = MLP(vec(\u03bdt (cid:12) st)),\n\nwith (cid:12) being the Hadamard product.\nState Estimation Our approach relies on being able to predict future object appearance and location,\nand therefore it heavily depends on state estimation. We use an LSTM, which can learn to trade-off\nspatio-temporal and appearance information in a data-driven fashion. It acts like a working memory,\nenabling the system to be robust to occlusions and oscillating object appearance e. g., when an object\nrotates and comes back to the original orientation.\n\not, ht = LSTM(vt, ht\u22121),\n\n\u03b1t+1, \u2206at+1, \u2206(cid:98)bt = MLP(ot, vec(st)),\n\nat+1 = at + tanh(c)\u2206at+1,\n\n(cid:98)bt = at + \u2206(cid:98)bt\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\n(8)\nEquations (5) to (8) detail the state updates. Spatial attention at time t is formed as a cumulative sum\nof attention updates from times t = 1 to t = T , where c is a learnable parameter initialised to a small\nvalue to constrain the size of the updates early in training. Since the spatial-attention mechanism\n\nis trained to predict where the object is going to go (Section 4), the bounding box(cid:98)bt is estimated\n\nrelative to attention at time t.\n\n4 Loss\nWe train our system by minimising a loss function comprised of: a tracking loss term, a set of terms\nfor auxiliary tasks and regularisation terms. Auxiliary tasks are essential for real-world data, since\nconvergence does not occur without them. They also speed up learning and lead to better performance\nfor simpler datasets. Unlike the auxiliary tasks used by Jaderberg et al. [15], ours are relevant for our\nmain objective \u2014 object tracking. In order to limit the number of hyperparameters, we automatically\nlearn loss weighting. The loss L(\u00b7) is given by\n\n(cid:110)\n(x1:T , b1:T )i(cid:111)M\n\nLHART(D, \u03b8) = \u03bbtLt(D, \u03b8) + \u03bbsLs(D, \u03b8) + \u03bbaLa(D, \u03b8) + R(\u03bb) + \u03b2R(D, \u03b8),\n\n(9)\n\n, network parameters \u03b8, regularisation terms R(\u00b7), adaptive\nwith dataset D =\nweights \u03bb = {\u03bbt, \u03bbs, \u03bbd} and a regularisation weight \u03b2. We now present and justify components of\nour loss, where expectations E[\u00b7] are evaluated as an empirical mean over a minibatch of samples\n\ni=1\n\n(cid:8)xi\n\n(cid:9)M\n\n1:T , bi\n\n1:T\n\ni=1, where M is the batch size.\n\nTracking To achieve the main tracking objective (localising the object in the current frame), we\nbase the \ufb01rst loss term on Intersection-over-Union (IoU) of the predicted bounding box w. r. t. the\nground truth, where the IoU of two bounding boxes is de\ufb01ned as IoU(a, b) = a\u2229b\na\u222ab = area of overlap\narea of union .\nThe IoU is invariant to object and image scale, making it a suitable proxy for measuring the quality\nof localisation. Even though it (or an exponential thereof) does not correspond to any probability\ndistribution (as it cannot be normalised), it is often used for evaluation [20]. We follow the work by\nYu et al. [28] and express the loss term as the negative log of IoU:\n\n(cid:104)\u2212 log IoU((cid:98)bt, bt)\n(cid:105)\n\np((cid:98)b1:T |x1:T ,b1)\n\n,\n\n(10)\n\nLt(D, \u03b8) = E\nwith IoU clipped for numerical stability.\n\n3vec : Rm\u00d7n \u2192 Rmn is the vectorisation operator, which stacks columns of a matrix into a column vector.\n\n5\n\n\ftime\n\nFigure 4: Tracking results on KTH dataset [24]. Starting with the \ufb01rst initialisation frame where all\nthree boxes overlap exactly, time \ufb02ows from left to right showing every 16th frame of the sequence\ncaptured at 25fps. The colour coding follows from Figure 1. The second row shows attention glimpses\nmultiplied with appearance attention.\nSpatial Attention Spatial attention singles out the tracked object from the image. To estimate its\nparameters, the system has to predict the object\u2019s motion. In case of an error, especially when the\nattention glimpse does not contain the tracked object, it is dif\ufb01cult to recover. As the probability of\nsuch an event increases with decreasing size of the glimpse, we employ two loss terms. The \ufb01rst one\nconstrains the predicted attention to cover the bounding box, while the second one prevents it from\nbecoming too large, where the logarithmic arguments are appropriately clipped to avoid numerical\ninstabilities:\n\nLs(D, \u03b8) = Ep(a1:T |x1:T ,b1)\n\n\u2212 log\n\n\u2212 log(1 \u2212 IoU(at, xt))\n\n.\n\n(11)\n\n(cid:20)\n\n(cid:18) at \u2229 bt\n\n(cid:19)\n\narea(bt)\n\n(cid:21)\n\nAppearance Attention The purpose of appearance attention is to suppress distractors while keeping\nfull view of the tracked object e. g., focus on a particular pedestrian moving within a group. To guide\nthis behaviour, we put a loss on appearance attention that encourages picking out only the tracked\nobject. Let \u03c4 (at, bt) : R4 \u00d7 R4 \u2192 {0, 1}hv\u00d7wv be a target function. Given the bounding box b and\nattention a, it outputs a binary mask of the same size as the output of V1. The mask corresponds to\nthe the glimpse g, with the value equal to one at every location where the bounding box overlaps\nz p(z) log q(z) to be the\n\nwith the glimpse and equal to zero otherwise. If we take H(p, q) = \u2212(cid:80)\n\ncross-entropy, the loss reads\n\nLa(D, \u03b8) = Ep(a1:T ,s1:T |x1:T ,b1)[H(\u03c4 (at, bt), st)].\n\n(12)\nRegularisation We apply the L2 regularisation to the model parameters \u03b8 and to the expected value\nof dynamic parameters \u03c8t(\u03b1t) as R(D, \u03b8) = 1\nAdaptive Loss Weights To avoid hyper-parameter tuning, we follow the work by Kendall et al. [19]\nand learn the loss weighting \u03bb. After initialising the weights with a vector of ones, we add the\n\nfollowing regularisation term to the loss function: R(\u03bb) = \u2212(cid:80)\n\n(cid:13)(cid:13)Ep(\u03b11:T |x1:T ,b1)[\u03a8t | \u03b1t](cid:13)(cid:13)2\n\n2(cid:107)\u03b8(cid:107)2\n\n2 + 1\n2\n\n2.\n\ni log(\u03bb\u22121\n\ni\n\n).\n\n5 Experiments\n5.1 KTH Pedestrian Tracking\n\nKaho\u00fa et al. [16] performed a pedestrian tracking experiment on the KTH activity recognition dataset\n[24] as a real-world case-study. We replicate this experiment for comparison. We use code provided\nby the authors for data preparation and we also use their pre-trained feature extractor. Unlike them,\nwe did not need to upscale ground-truth bounding boxes by a factor of 1.5 and then downscale\nthem again for evaluation. We follow the authors and set the glimpse size (h, w) = (28, 28). We\nreplicate the training procedure exactly, with the exception of using the RMSProp optimiser [9] with\nlearning rate of 3.33 \u00d7 10\u22125 and momentum set to 0.9 instead of the stochastic gradient descent\nwith momentum. The original work reported an IoU of 55.03% on average, on test data, while the\npresented work achieves an average IoU score of 77.11%, reducing the relative error by almost a\nfactor of two. Figure 4 presents qualitative results.\n\n5.2 Scaling to Real-World Data: KITTI\n\nSince we demonstrated that pedestrian tracking is feasible using the proposed architecture, we proceed\nto evaluate our model in a more challenging multi-class scenario on the KITTI dataset [8]. It consists\n\n6\n\n\fMethod\n\nAvg. IoU\n\nKaho\u00fa et al. [16]\n\nSpatial Att\nApp Att\nHART\n\n0.14\n0.60\n0.78\n0.81\n\nTable 1: Average IoU on KITTI\nover 60 time-steps.\n\nFigure 5: IoU curves on KITTI over 60 timesteps. HART\n(train) presents evaluation on the train set (we do not over\ufb01t).\nof 21 high resolution video sequences with multiple instances of the same class posing as potential\ndistractors. We split all sequences into 80/20 sequences for train and test sets, respectively. As images\nin this dataset are much more varied, we implement V1 as the \ufb01rst three convolutional layers of a\nmodi\ufb01ed AlexNet [1]. The original AlexNet takes inputs of size 227 \u00d7 227 and downsizes them to\n14 \u00d7 14 after conv3 layer. Since too low resolution would result in low tracking performance, and\nwe did not want to upsample the extracted glimpse, we decided to replace the initial stride of four\nwith one and to skip one of the max-pooling operations to conserve spatial dimensions. This way,\nour feature map has the size of 14 \u00d7 14 \u00d7 384 with the input glimpse of size (h, w) = (56, 56). We\napply dropout with probability 0.25 at the end of V1. The ventral stream is comprised of a single\nconvolutional layer with a 1 \u00d7 1 kernel and \ufb01ve output feature maps. The dorsal stream has two\ndynamic \ufb01lter layers with kernels of size 1\u00d7 1 and 3\u00d7 3, respectively and \ufb01ve feature maps each. We\nused 100 hidden units in the RNN with orthogonal initialisation and Zoneout [21] with probability\nset to 0.05. The system was trained via curriculum learning [2], by starting with sequences of length\n\ufb01ve and increasing sequence length every 13 epochs, with epoch length decreasing with increasing\nsequence length. We used the same optimisation settings, with the exception of the learning rate,\nwhich we set to 3.33 \u00d7 10\u22126.\nTable 1 and Figure 5 contain results of different variants of our model and of the RATM tracker by\nKaho\u00fa et al. [16] related works. Spatial Att does not use appearance attention, nor loss on attention\nparameters. App Att does not apply any loss on appearance attention, while HART uses all described\nmodules; it is also our biggest model with 1.8 million parameters. Qualitative results in the form of a\nvideo with bounding boxes and attention are available online 4. We implemented the RATM tracker\nof Kaho\u00fa et al. [16] and trained with the same hyperparameters as our framework, since both are\nclosely related. It failed to learn even with the initial curriculum of \ufb01ve time-steps, as RATM cannot\nintegrate the frame xt into the estimate of bt (it predicts location at the next time-step). Furthermore,\nit uses feature-space distance between ground-truth and predicted attention glimpses as the error\nmeasure, which is insuf\ufb01cient on a dataset with rich backgrounds. It did better when we initialised its\nfeature extractor with weights of our trained model but, despite passing a few stags of the curriculum,\nit achieved very poor \ufb01nal performance.\n6 Discussion\nThe experiments in the previous section show that it is possible to track real-world objects with\na recurrent attentive tracker. While similar to the tracker by Kaho\u00fa et al. [16], our approach uses\nadditional building blocks, speci\ufb01cally: (i) bounding-box regression loss, (ii) loss on spatial attention,\n(iii) appearance attention with an additional loss term, and (iv) combines all of these in a uni\ufb01ed\napproach. We now discuss properties of these modules.\n\nSpatial Attention Loss prevents Vanishing Gradients Our early experiments suggest that using\nonly the tracking loss causes an instance of the vanishing gradient problem. Early in training, the\nsystem is not able to estimate object\u2019s motion correctly, leading to cases where the extracted glimpse\ndoes not contain the tracked object or contains only a part thereof. In such cases, the supervisory\nsignal is only weakly correlated with the model\u2019s input, which prevents learning. Even when the\nobject is contained within the glimpse, the gradient path from the loss function is rather long, since\nany teaching signal has to pass to the previous timestep through the feature extractor stage. Penalising\nattention parameters directly seems to solve this issue.\n\n4https://youtu.be/Vvkjm0FRGSs\n\n7\n\n\f(a) The model with appearance attention loss (top) learns to focus on the tracked object, which prevents an ID\nswap when a pedestrian is occluded by another one (bottom).\n\n(b) Three examples of glimpses and locations maps for a model with and without appearance loss (left to right).\nAttention loss forces the appearance attention to pick out only the tracked object, thereby suppressing distractors.\nFigure 6: Glimpses and corresponding location maps for models trained with and without appearance\nloss. The appearance loss encourages the model to learn foreground/background segmentation of the\ninput glimpse.\n\nIs Appearance Attention Loss Necessary? Given enough data and suf\ufb01ciently high model capacity,\nappearance attention should be able to \ufb01lter out irrelevant input features before updating the working\nmemory. In general, however, this behaviour can be achieved faster if the model is constrained to do\nso by using an appropriate loss. Figure 6 shows examples of glimpses and corresponding location\nmaps for a model with and without loss on the appearance attention. In \ufb01gure 6a the model with loss\non appearance attention is able to track a pedestrian even after it was occluded by another human.\nFigure 6b shows that, when not penalised, location map might not be very object-speci\ufb01c and can\nmiss the object entirely (right-most \ufb01gure). By using the appearance attention loss, we not only\nimprove results but also make the model more interpretable.\nSpatial Attention Bias is Always Positive To condition the system on the object\u2019s appearance and\nmake it independent of the starting location, we translate the initial bounding box to attention parame-\nters, to which we add a learnable bias, and create the hidden state of LSTM from corresponding visual\nfeatures. In our experiments, this bias always converged to positive values favouring attention glimpse\nslightly larger than the object bounding box. It suggests that, while discarding irrelevant features is\ndesirable for object tracking, the system as a whole learns to trade off attention responsibility between\nthe spatial and appearance based attention modules.\n\n7 Conclusion\nInspired by the cascaded attention mechanisms found in the human visual cortex, this work presented\na neural attentive recurrent tracking architecture suited for the task of object tracking. Beyond\nthe biological inspiration, the proposed approach has a desirable computational cost and increased\ninterpretability due to location maps, which select features essential for tracking. Furthermore, by\nintroducing a set of auxiliary losses we are able to scale to challenging real world data, outperforming\npredecessor attempts and approaching state-of-the-art performance. Future research will look into\nextending the proposed approach to multi-object tracking, as unlike many single object tracking, the\nrecurrent nature of the proposed tracker offers the ability to attend each object in turn.\nAcknowledgements\nWe would like to thank Oiwi Parker Jones and Martin Engelcke for discussions and valuable insights\nand Neil Dhir for his help with editing the paper. Additionally, we would like to acknowledge the\nsupport of the UK\u2019s Engineering and Physical Sciences Research Council (EPSRC) through the\nProgramme Grant EP/M019918/1 and the Doctoral Training Award (DTA). The donation from Nvidia\nof the Titan Xp GPU used in this work is also gratefully acknowledged.\nReferences\n[1] A. Krizhevsky, I. Sutskever, and Geoffrey E. Hinton. ImageNet Classi\ufb01cation with Deep Convolutional\n\n8\n\n\fNeural Networks. In NIPS, pages 1097\u20131105, 2012.\n\n[2] Yoshua Bengio, J\u00e9r\u00f4me Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML,\n\nNew York, New York, USA, 2009. ACM Press.\n\n[3] Brian Cheung. Neural Attention for Object Tracking. In GPU Technol. Conf., 2016.\n[4] Brian Cheung, Eric Weiss, and Bruno Olshausen. Emergence of foveal image sampling from learning to\n\nattend in visual scenes. ICLR, 2017.\n\n[5] Peter. Dayan and L. F. Abbott. Theoretical neuroscience : computational and mathematical modeling of\n\nneural systems. Massachusetts Institute of Technology Press, 2001.\n\n[6] Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic Filter Networks. NIPS, 2016.\n[7] S. M. Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Koray Kavukcuoglu,\nand Geoffrey E. Hinton. Attend, Infer, Repeat: Fast Scene Understanding with Generative Models. In\nNIPS, 2016.\n\n[8] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The KITTI dataset. Int. J. Rob. Res.,\n\n32(11):1231\u20131237, sep 2013.\n\n[9] Hinton Geoffrey, Nitish Srivastava, and Kevin Swersky. Overview of mini-batch gradient descent, 2012.\n[10] Daniel Gordon, Ali Farhadi, and Dieter Fox. Re3 : Real-Time Recurrent Regression Networks for Object\n\nTracking. In arXiv:1705.06368, 2017.\n\n[11] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwi\u00b4nska,\nSergio G\u00f3mez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, Adri\u00e0 Puigdom\u00e8nech\nBadia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen King, Christopher Summer-\n\ufb01eld, Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis. Hybrid computing using a neural network\nwith dynamic external memory. Nature, 538(7626):471\u2013476, oct 2016.\n\n[12] K Gregor, I Danihelka, A Graves, and D Wierstra. DRAW: A Recurrent Neural Network For Image\n\n[13] David Held, Sebastian Thrun, and Silvio Savarese. Learning to track at 100 FPS with deep regression\n\n[14] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial Transformer\n\nGeneration. ICML, 2015.\n\nnetworks. In ECCV Work. Springer, 2016.\n\nNetworks. In NIPS, 2015.\n\n[15] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and\nKoray Kavukcuoglu. Reinforcement Learning with Unsupervised Auxiliary Tasks. In arXiv:1611.05397,\n2016.\n\n[16] Samira Ebrahimi Kaho\u00fa, Vincent Michalski, and Roland Memisevic. RATM: Recurrent Attentive Tracking\n\nModel. CVPR Work., 2017.\n\n[17] Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep Variational Bayes\n\nFilters: Unsupervised Learning of State Space Models from Raw Data. In ICLR, 2017.\n\n[18] Sabine Kastner and Leslie G. Ungerleider. Mechanisms of visual attention in the human cortex. Annu. Rev.\n\nNeurosci., 23(1):315\u2013341, 2000.\n\n[19] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-Task Learning Using Uncertainty to Weigh Losses\n\nfor Scene Geometry and Semantics. arXiv:1705.07115, may 2017.\n\n[20] Matej Kristan, Jiri Matas, Ale\u0161 Leonardis, Michael Felsberg, Luk Cehovin, Gustavo Fern\u00e1ndez, Tom\u00e1\u0161\nVoj\u00ed, Gustav H\u00e4ger, Georg Nebehay, Roman P\ufb02ugfelder, Abhinav Gupta, Adel Bibi, Alan Luke\u017ei\u02c7c, Alvaro\nGarcia-Martin, Amir Saffari, Philip H S Torr, Qiang Wang, Rafael Martin-Nieto, Rengarajan Pelapur,\nRichard Bowden, Chun Zhu, Stefan Becker, Stefan Duffner, Stephen L Hicks, Stuart Golodetz, Sunglok\nChoi, Tianfu Wu, Thomas Mauthner, Tony Pridmore, Weiming Hu, Wolfgang H\u00fcbner, Xiaomeng Wang,\nXin Li, Xinchu Shi, Xu Zhao, Xue Mei, Yao Shizeng, Yang Hua, Yang Li, Yang Lu, Yuezun Li, Zhaoyun\nChen, Zehua Huang, Zhe Chen, Zhe Zhang, Zhenyu He, and Zhibin Hong. The Visual Object Tracking\nVOT2016 challenge results. In ECCV Work., 2016.\n\n[21] David Krueger, Tegan Maharaj, J\u00e1nos Kram\u00e1r, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary\nKe, Anirudh Goyal, Yoshua Bengio, Aaron Courville, and Chris Pal. Zoneout: Regularizing RNNs by\nRandomly Preserving Hidden Activations. In ICLR, 2017.\n\n[22] Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent Models of Visual\n\nAttention. In NIPS, 2014.\n\n[23] Guanghan Ning, Zhi Zhang, Chen Huang, Zhihai He, Xiaobo Ren, and Haohong Wang. Spatially Super-\nvised Recurrent Convolutional Neural Networks for Visual Object Tracking. arXiv Prepr. arXiv1607.05781,\n2016.\n\n[24] Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: A local SVM approach.\n\nIn ICPR. IEEE, 2004.\n\n[25] Marijn Stollenga, Jonathan Masci, Faustino Gomez, and Juergen Schmidhuber. Deep Networks with\n\nInternal Selective Attention through Feedback Connections. In arXiv Prepr. arXiv . . . , page 13, 2014.\n\n[26] Jack Valmadre, Luca Bertinetto, Jo\u00e3o F. Henriques, Andrea Vedaldi, and Philip H. S. Torr. End-to-end\n\nrepresentation learning for Correlation Filter based tracking. In CVPR, 2017.\n\n[27] Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar as a\n\nForeign Language. In NIPS, 2015.\n\n[28] Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and Thomas Huang. UnitBox: An Advanced\n\nObject Detection Network. In Proc. 2016 ACM Multimed. Conf., pages 516\u2013520. ACM, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1739, "authors": [{"given_name": "Adam", "family_name": "Kosiorek", "institution": "University of Oxford"}, {"given_name": "Alex", "family_name": "Bewley", "institution": "University of Oxford"}, {"given_name": "Ingmar", "family_name": "Posner", "institution": "Oxford University"}]}