{"title": "Deep Neural Networks with Inexact Matching for Person Re-Identification", "book": "Advances in Neural Information Processing Systems", "page_first": 2667, "page_last": 2675, "abstract": "Person Re-Identification is the task of matching images of a person across multiple camera views. Almost all prior approaches address this challenge by attempting to learn the possible transformations that relate the different views of a person from a training corpora. Then, they utilize these transformation patterns for matching a query image to those in a gallery image bank at test time. This necessitates learning good feature representations of the images and having a robust feature matching technique. Deep learning approaches, such as Convolutional Neural Networks (CNN), simultaneously do both and have shown great promise recently. In this work, we propose two CNN-based architectures for Person Re-Identification. In the first, given a pair of images, we extract feature maps from these images via multiple stages of convolution and pooling. A novel inexact matching technique then matches pixels in the first representation with those of the second. Furthermore, we search across a wider region in the second representation for matching. Our novel matching technique allows us to tackle the challenges posed by large viewpoint variations, illumination changes or partial occlusions. Our approach shows a promising performance and requires only about half the parameters as a current state-of-the-art technique. Nonetheless, it also suffers from false matches at times. In order to mitigate this issue, we propose a fused architecture that combines our inexact matching pipeline with a state-of-the-art exact matching technique. We observe substantial gains with the fused model over the current state-of-the-art on multiple challenging datasets of varying sizes, with gains of up to about 21%.", "full_text": "Deep Neural Networks with Inexact Matching for\n\nPerson Re-Identi\ufb01cation\n\nArulkumar Subramaniam\n\nIndian Institute of Technology Madras\n\nChennai, India 600036\n\naruls@cse.iitm.ac.in\n\nMoitreya Chatterjee\n\nIndian Institute of Technology Madras\n\nChennai, India 600036\n\nmetro.smiles@gmail.com\n\nAnurag Mittal\n\nIndian Institute of Technology Madras\n\nChennai, India 600036\n\namittal@cse.iitm.ac.in\n\nAbstract\n\nPerson Re-Identi\ufb01cation is the task of matching images of a person across multiple\ncamera views. Almost all prior approaches address this challenge by attempting to\nlearn the possible transformations that relate the different views of a person from a\ntraining corpora. Then, they utilize these transformation patterns for matching a\nquery image to those in a gallery image bank at test time. This necessitates learning\ngood feature representations of the images and having a robust feature matching\ntechnique. Deep learning approaches, such as Convolutional Neural Networks\n(CNN), simultaneously do both and have shown great promise recently.\nIn this work, we propose two CNN-based architectures for Person Re-Identi\ufb01cation.\nIn the \ufb01rst, given a pair of images, we extract feature maps from these images via\nmultiple stages of convolution and pooling. A novel inexact matching technique\nthen matches pixels in the \ufb01rst representation with those of the second. Further-\nmore, we search across a wider region in the second representation for matching.\nOur novel matching technique allows us to tackle the challenges posed by large\nviewpoint variations, illumination changes or partial occlusions. Our approach\nshows a promising performance and requires only about half the parameters as a\ncurrent state-of-the-art technique. Nonetheless, it also suffers from false matches at\ntimes. In order to mitigate this issue, we propose a fused architecture that combines\nour inexact matching pipeline with a state-of-the-art exact matching technique. We\nobserve substantial gains with the fused model over the current state-of-the-art on\nmultiple challenging datasets of varying sizes, with gains of up to about 21%.\n\n1\n\nIntroduction\n\nSuccessful object recognition systems, such as Convolutional Neural Networks (CNN), extract\n\u201cdistinctive patterns\u201d that describe an object (e.g. a human) in an image, when \u201cshown\u201d several images\nknown to contain that object, exploiting Machine Learning techniques [1]. Through successive stages\nof convolutions and a host of non-linear operations such as pooling, non-linear activation, etc., CNNs\nextract complex yet discriminative representation of objects that are then classi\ufb01ed into categories\nusing a classi\ufb01er, such as softmax.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Some common challenges in Person Re-Identi\ufb01cation.\n\n1.1 Problem De\ufb01nition\n\nOne of the key subproblems of the generic object recognition task is recognizing people. Of\nspecial interest to the surveillance and Human-Computer interaction (HCI) community is the task\nof identifying a particular person across multiple images captured from the same/different cameras,\nfrom different viewpoints, at the same/different points in time. This task is also known as Person\nRe-Identi\ufb01cation. Given a pair of images, such systems should be able to decipher if both of them\ncontain the same person or not. This is a challenging task, since appearances of a person across images\ncan be very different due to large viewpoint changes, partial occlusions, illumination variations, etc.\nFigure 1 highlights some of these challenges. This also leads to a fundamental research question:\nCan CNN-based approaches effectively handle the challenges related to Person Re-Identi\ufb01cation?\nCNN-based approaches have recently been applied to this task, with reasonable success [2, 3] and are\namongst the most competitive approaches for this task. Inspired by such approaches, we explore a set\nof novel CNN-based architectures for this task. We treat the problem as a classi\ufb01cation task. During\ntraining, for every pair of images, the model is told whether they are from the same person or not.\nAt test time, the posterior classi\ufb01cation probabilities obtained from the models are used to rank the\nimages in a gallery image set in terms of their similarity to a query image (probe).\nIn this work, we propose two novel CNN-based schemes for Person Re-Identi\ufb01cation. Our \ufb01rst\nmodel hinges on the key observation that due to a wide viewpoint variation, the task of \ufb01nding a\nmatch between the pixels of a pair of images needs to be carried out over a larger region, since\n\u201cmatching pixels\u201d on the object may have been displaced signi\ufb01cantly. Secondly, illumination\nvariations might cause the absolute intensities of image regions to vary, rendering exact matching\napproaches ineffective. Finally, coupling these two solutions might provide a recipe for taking care\nof partial occlusions as well. We call this \ufb01rst model of ours, Normalized X-Corr. However, the\n\ufb02exibility of inexact (soft) matching over a wider search space comes at the cost of occasional false\nmatches. To remedy this, we propose a second CNN-based model which fuses a state-of-the-art exact\nmatching technique [2] with Normalized X-Corr. We hypothesize that proper training allows the two\ncomponents of the fused network to learn complimentary patterns from the data, thus aiding the \ufb01nal\nclassi\ufb01cation. Empirical results show Normalized X-Corr to hold promise and the Fused network\noutperforming all baseline approaches on multiple challenging datasets, with gains of upto 21% over\nthe baselines.\nIn the next section, we touch upon relevant prior work. We present our methodology in Section\n3. Sections 4 and 5 deal with the Experiments and the discussion of the obtained Results thereof.\nFinally, we conclude in Section 6, outlining some avenues worthy of exploration in the future.\n\n2 Related Work\n\nIn broad terms, we categorize the prior work in this \ufb01eld into Non-Deep and Deep Learning ap-\nproaches.\n\n2.1 Non-Deep Learning Approaches\n\nPerson Re-Identi\ufb01cation systems have two main components: Feature Extraction and Similarity\nMetric for matching. Traditional approaches for Person Re-Identi\ufb01cation either proposed useful\nfeatures, or discriminative similarity metrics for comparison, or both.\n\n2\n\n\fTypical features that have proven useful include color histograms and dense SIFT [4] features\ncomputed over local patches in the image [5, 6]. Farenzena et al. represent image patches by\nexploiting features that model appearance, chromatic content, etc [7]. Another interesting line of\nwork on feature representation attempts to learn a bag-of-features (a.k.a. a dictionary)-based approach\nfor image representation [8, 9]. Further, Prosser et al. show the effectiveness of learning a subspace\nfor representing the data, modeled using a set of standard features [10]. While these approaches show\npromise, their performance is bounded by the ability to engineer good features. Our models, based\non deep learning, overcome this handicap by learning a representation from the data.\nThere is also a substantial body of work that attempts to learn an effective similarity metric for\ncomparing images [11, 12, 13, 14, 15, 16]. Here, the objective is to learn a distance measure that\nis indicative of the similarity of the images. The Mahalanobis distance has been the most common\nmetric that has been adopted for matching in person re-identi\ufb01cation [5, 17, 18, 19]. Some other\nmetric learning approaches attempt to learn transformations, which when applied to the feature space,\naligns similar images closer together [20]. Yet other successful metric learning approaches are an\nensemble of multiple metrics [21]. In contrast to these approaches, we jointly learn both features and\ndiscriminative metric (using a classi\ufb01er) in a deep learning framework.\nAnother interesting line of non-deep approaches for person re-identi\ufb01cation have claimed novelty\nboth in terms of features as well as matching metrics [22, 23]. Many of them rely on weighing the\nhand-engineered image features \ufb01rst, based on some measure such as saliency and then performing\nmatching [6, 24, 25]. However, this is done in a non-deep framework unlike ours.\n\n2.2 Deep Learning based Approaches\n\nThere has been relatively fewer prior work based on deep learning for addressing the challenge of\nPerson Re-Identi\ufb01cation. Most deep learning approaches exploit the CNN-framework for the task, i.e.\nthey \ufb01rst extract highly non-linear representations from the images, then they compute some measure\nof similarity. Yi et al. propose a Siamese Network that takes as input the image pair that is to be\ncompared, performs 3 stages of convolution on them (with the kernels sharing weights), and \ufb01nally\nuses cosine similarity to judge their extent of match [26]. Both of our models differ by performing a\nnovel inexact matching of the images after two stages of convolution and then processing the output\nof the matching layer to arrive at a decision.\nLi et al. also adopt a two-input network architecture [3]. They take the product of the responses\nobtained right after the \ufb01rst set of convolutions corresponding to the two inputs and process its output\nto obtain a measure of similarity. Our models, on the other hand, are signi\ufb01cantly deeper. Besides,\nNormalized X-Corr stands out by retaining the matching outcome corresponding to every candidate in\nthe search space of a pixel rather than choosing only the maximum match. Ahmed et al. too propose\na very promising architecture for Person Re-Identi\ufb01cation [2]. Our models are inspired from their\nwork and does incorporate some of their features, but we substantially differ from their approach by\nperforming an inexact matching over a wider search space after two stages of convolution. Further,\nNormalized X-Corr has fewer parameters than Ahmed et al. [2].\nFinally, our Fused model is a one of a kind deep learning architecture for Person Re-Identi\ufb01cation.\nThis is because a combination (fusion) of multiple deep frameworks has hitherto remained unexplored\nfor this task.\n\n3 Proposed Approach\n\n3.1 Our Architecture\n\nIn this work, we propose two architectures for Person Re-Identi\ufb01cation. Both of our architectures\nare a type of \u201cSiamese\u201d-CNN model which take as input two images for matching and outputs the\nlikelihood that the two images contain the same person.\n\n3.1.1 Normalized X-Corr\n\nThe following are the principal components of the Normalized X-Corr model, as shown in Fig 2.\n\n3\n\n\fFigure 2: Architecture of the Normalized X-Corr Model.\n\nFigure 3: Architecture of the Fused Network Model.\n\nTied Convolution Layers: Convolutional features have been shown to be effective representation of\nimages [1, 2]. In order to measure similarity between the input images, the applied transformations\nmust be similar. Therefore, along the lines of Ahmed et al., we perform two stages of convolutions\nand pooling on both the input images by passing them through two input pipelines of a \u201cSiamese\u201d\nNetwork, that share weights [2]. The \ufb01rst convolution layer takes as input images of size 60\u00d7160\u00d73\nand applies 20 learned \ufb01lters of size 5\u00d75\u00d73, while the second one applies 25 learned \ufb01lters of size\n5\u00d75\u00d720. Both convolutions are followed by pooling layers, which reduce the output dimension by a\nfactor of 2, and ReLU (Recti\ufb01ed Linear Unit) clipping. This gives us 25 maps of dimension 12\u00d737\nas output from each branch which are fed to the next layer.\nNormalized Correlation Layer: This is the \ufb01rst layer that captures the similarity of the two input\nimages; subsequent layers build on the output of this layer to \ufb01nally arrive at a decision as to whether\nthe two images are of the same person or not. Different from [2, 3], we incorporate both inexact\nmatching and wider search. Given two corresponding input feature maps X and Y , we compute the\nnormalized correlation as follows. We start with every pixel of X located at (x, y), where x is along\nthe width and y along the height (denoted as X(x, y)). We then create two matrices. The \ufb01rst is a\n5\u00d75 matrix representing the 5\u00d75 neighborhood of X(x, y), while the second is the corresponding\n5\u00d75 neighborhood of Y centered at (a, b), where 1 \u2264 a \u2264 12 and y \u2212 2 \u2264 b \u2264 y + 2. Now,\nmarkedly different from Ahmed et al. [2], we perform inexact matching over a wider search space, by\ncomputing a Normalized Correlation between the two patch matrices. Given two matrices, E and F ,\nwhose elements are arranged as two N-dimensional vectors, the Normalized Correlation is given by:\n\n(cid:80)N\ni=1(Ei \u2212 \u00b5E)(Fi \u2212 \u00b5F )\n\n(N \u2212 1).\u03c3E.\u03c3F\n\nnormxcorr(E, F ) =\n\n,\n\nwhere \u00b5E, \u00b5F denotes the means of the elements of the 2 matrices, E and F respectively, while\n\u03c3E, \u03c3F denotes their respective standard deviations (a small \u0001 = 0.01 is added to \u03c3E and \u03c3F to\navoid division by 0). Interestingly, Normalized Correlation being symmetric, we need to model the\ncomputation only in one-way, thereby cutting down the number of parameters in subsequent layers.\nFor every pair of 5\u00d75 matrices corresponding to a given pixel in image X, we arrange the normalized\ncorrelation values in different feature maps. These feature maps preserve the spatial ordering of\n\n4\n\n3712NORMALIZEDCORRELATION+RELU 37251x1x1500 \u27f6 251035253x3x25 \u27f6 25525samedifferentFully connected Layer5005x5x3 \u27f6 20CONV + RELU1565620202878MAXPOOL2x25x5x20 \u27f6 25CONV + RELU742425251237MAXPOOL2x25x5x3 \u27f6 20CONV + RELU1565620202878MAXPOOL2x25x5x20 \u27f6 25CONV + RELU7425251237MAXPOOL2x2150012CONV + RELUMAXPOOL2x2CONV 1724Shared weights6060160160Cross Patch Feature AggregationSoftmaxlayer3712NORMALIZEDCORRELATION+RELU 37251x1x1500 \u27f6 251035253x3x25 \u27f6 25525samedifferent5005x5x3 \u27f6 20CONV + RELU1565620202878MAXPOOL2x25x5x20 \u27f6 25CONV + RELU742425251237MAXPOOL2x25x5x3 \u27f6 20CONV + RELU1565620202878MAXPOOL2x25x5x20 \u27f6 25CONV + RELU7425251237MAXPOOL2x2150012CONV + RELUMAXPOOL2x2CONV24Shared weights606016016012CROSS INPUTNEIGHBORHOOD+RELU 37251x1x1250 \u27f6 251035253x3x25 \u27f6 25525125012CONV + RELUMAXPOOL2x2CONV 175003717Cross Patch Feature AggregationPatch summaryFully connected LayerSoftmaxlayer\fpixels in X and are also 12\u00d737 each, but their pixel values represent normalized correlation. This\ngives us 60 feature maps of dimension 12\u00d737 each. Now similarly, we perform the same operation\nfor all 25 pairs of maps that are input to the Normalized Correlation layer, to obtain an output of\n1500, 12\u00d737 maps. One subtle but important difference between our approach and that of Li et al.\n[3] is that we preserve every correlation output corresponding to the search space of a pixel, X(x, y),\nwhile they only keep the maximum response. We then pass these set of feature maps through a ReLU,\nto discard probable noisy matches.\nThe mean subtraction and standard deviation normalization step incorporates illumination invariance,\na step unaccounted for in Li et al. [3]. Thus two patches which differ only in absolute intensity values\nbut are similar in the intensity variation pattern would be treated as similar by our models. The wider\nsearch space, compared to Ahmed et al. [2], gives invariance to large viewpoint variation. Further,\nperforming inexact matching (correlation measure) over a wider search space gives us robustness to\npartial occlusions. Due to partial occlusions, a part(s) (P) of a person/object visible in one view may\nnot be visible in others. Using a wider search space, our model looks for a part which is similar to\nthe missing P within a wider neighborhood of P\u2019s original location. This is justi\ufb01ed since typically\nadjacent regions of objects in an image have regularity/similarity in appearance. e.g. Bottom and\nupper half of torso. Now, since we are comparing two different parts due to the occlusion of P, we\nneed to perform \ufb02exible matching. Thus, inexact matching is used.\nCross Patch Feature Aggregation Layers: The Normalized Correlation layer incorporates informa-\ntion from the local neighborhood of a pixel. We now seek to incorporate greater context information,\nto obtain a summarization effect. In order to do so, we perform 2 successive layers of convolution\n(with a ReLU \ufb01ltered output) followed by pooling (by a factor of 2) of the output feature maps from\nthe previous layer. We use 1\u00d71\u00d71500 convolution kernels for the \ufb01rst layer and 3\u00d73\u00d725 convolution\nkernels for the second convolution layer. Finally, we get 25 maps of dimension 5\u00d717 each.\nFully Connected Layers: The fully connected layers collate information from pixels that are very\nfar apart. The feature map outputs obtained from the previous layer are reshaped into one long\n2125-dimensional vector. This vector is fed as input to a 500-node fully connected layer, which\nconnects to another fully connected layer containing 2 softmax units. The \ufb01rst unit outputs the\nprobability that the two images are same and the latter, the probability that the two are different.\nOne key advantage of the Normalized X-Corr model is that it has about half the number of parameters\n(about 1.121 million) as the model proposed by Ahmed et al. [2] (refer supplementary section for\nmore details).\n\n3.1.2 Fused Model\n\nWhile the Normalized X-Corr model incorporates inexact matching over a wider search space to\nhandle important challenges such as illumination variations, partial occlusions, or wide viewpoint\nchanges, however it also suffers from occasional false matches. Upon investigation, we found that\nthese false matches tended to recur, especially when the background of the false matches had a similar\nappearance to the person being matched (see supplementary). For such cases, an exact matching,\nsuch as taking a difference and constraining the search window might be bene\ufb01cial. We therefore\nfuse the model proposed by Ahmed et al. [2] with Normalized X-Corr to obtain a Fused model, in\nanticipation that it incorporates the bene\ufb01ts of both models. Figure 3 shows a representative diagram.\nWe keep the tied convolution layers unchanged like before, then we fork off two separate pipelines:\none for Normalized X-Corr and the other for Ahmed et. al.\u2019s model [2]. The two separate pipelines\noutput a 2125-dimensional vector each and then they are fused in a 1000-node fully connected layer.\nThe outputs from the fully connected layer are then fed into a 2 unit softmax layer as before.\n\n3.2 Training Algorithm\n\nAll the proposed architectures are trained using the Stochastic Gradient Descent (SGD) algorithm, as\nin Ahmed et al. [2]. The gradient computation is fairly simple except for the Normalized Correlation\nlayer. Given two matrices, E (from the \ufb01rst branch of the Siamese network) and F (from the second\nbranch of the Siamese network), represented by a N-dimensional vector each, the gradient pushed\nfrom the Normalized Correlation layer back to the convolution layers on the top branch is given by:\n\n(cid:18) Fi \u2212 \u00b5F\n\n\u03c3F\n\n5\n\n\u2202normxcorr(E, F )\n\n\u2202Ei\n\n=\n\n1\n\n(N \u2212 1)\u03c3E\n\n\u2212 normxcorr(E, F )(Ei \u2212 \u00b5E)\n\n\u03c3E\n\n(cid:19)\n\n,\n\n\fwhere Ei is the ith element of the vector representing E and other symbols have their usual meaning. Similar\nnotation is used for the subnetwork at the bottom. The full derivation is available in the supplementary section.\n\n4 Experiments\n\nTable 1: Performance of different algorithms at ranks 1, 10, and 20 on CUHK03 Labeled (left) and\nCUHK03 Detected (right) Datasets.\n\nMethod\n\nFused Model (ours)\nNorm X-Corr (ours)\nEnsembles [21]\nLOMO+MLAPG [5]\nAhmed et al. [2]\nLOMO+XQDA [22]\nLi et al. [3]\nKISSME [18]\nLDML [14]\neSDC [25]\n\nr = 1\n72.43\n64.73\n62.1\n57.96\n54.74\n52.20\n20.65\n14.17\n13.51\n8.76\n\nr = 10\n95.51\n92.77\n92.30\n94.74\n93.88\n92.14\n68.74\n52.57\n52.13\n38.28\n\nr = 20\n98.40\n96.78\n97.20\n98.00\n98.10\n96.25\n83.06\n70.03\n70.81\n53.44\n\nMethod\n\nFused Model (ours)\nNorm X-Corr (ours)\nLOMO+MLAPG [5]\nAhmed et al. [2]\nLOMO+XQDA [22]\nLi et al. [3]\nKISSME [18]\nLDML [14]\neSDC [25]\n\nr = 1\n72.04\n67.13\n51.15\n44.96\n46.25\n19.89\n11.70\n10.92\n7.68\n\nr = 10\n96.00\n94.49\n92.05\n83.47\n88.55\n64.79\n48.08\n47.01\n33.38\n\nr = 20\n98.26\n97.66\n96.90\n93.15\n94.25\n81.14\n64.86\n65.00\n50.58\n\n4.1 Datasets, Evaluation Protocol, Baseline Methods\n\nWe conducted experiments on the large CUHK03 dataset [3], the mid-sized CUHK01 Dataset [23], and the small\nQMUL GRID dataset [27]. The datasets are divided into training and test sets for our experiments. The goal of\nevery algorithm is to rank images in the gallery image bank of the test set by their similarity to a probe image\n(which is also from the test set). To do so, they can exploit the training set, consisting of matched and unmatched\nimage pairs. An oracle would always rank the ground truth match (from the gallery) in the \ufb01rst position. All\nour experiments are conducted in the single shot setting, i.e. there is exactly one image of every person in the\ngallery image bank and the results averaged over 10 test trials are reported using tables and Cumulative Matching\nCharacteristics (CMC) Curves (see supplementary). For all our experiments, we use a momentum of 0.9, starting\nlearning rate of 0.05, learning rate decay of 1 \u00d7 10\u22124, weight decay of 5 \u00d7 10\u22124. The implementation was done\nin a machine with NVIDIA Titan GPUs and the code was implemented using Torch and is available online 1. We\nalso conducted an ablation study, to further analyze the contribution of the individual components of our model.\nCUHK03 Dataset: The CUHK03 dataset is a large collection of 13,164 images of 1360 people captured from 6\ndifferent surveillance cameras, with each person observed by 2 cameras with disjoint views [3]. The dataset\ncomes with manual and algorithmically labeled pedestrian bounding boxes. In this work, we conduct experiments\non both these sets. For our experiments, we follow the protocol used by Ahmed et al. [2] and randomly pick a\nset of 1260 identities for training and 100 for testing. We use 100 identities from the training set for validation.\nWe compare the performance of both Normalized X-Corr and the Fused model with several baselines for both\nlabeled [2, 3, 5, 14, 18, 21, 22, 25] and detected [2, 3, 5, 14, 18, 22, 25] sets. Of these, the comparison with\nAhmed et al. [2] and with Li et al. [3] is of special interest to us since these are deep learning approaches as well.\nFor our models, we use mini-batch sizes of 128 and train our models for about 200,000 iterations.\nCUHK01 Dataset: The CUHK01 dataset is a mid-sized collection of 3,884 images of 971 people, with each\nperson observed by 2 cameras with disjoint views [23]. There are 4 images of every identity. For our experiments,\nwe follow the protocol used by Ahmed et al. [2] and conduct 2 sets of experiments with varying training set\nsizes. In the \ufb01rst, we randomly pick a set of 871 identities for training and 100 for testing, while in the second,\n486 identities are used for testing and the rest for training. We compare the performance of both of our models\nwith several baselines for both 100 test identities [2, 3, 14, 18, 25] and 486 test identities [2, 8, 9, 20, 21]. For\nour models, we use mini-batch sizes of 128 and train our models for about 50,000 iterations.\nQMUL GRID Dataset: The QMUL underGround Re-Identi\ufb01cation (GRID) dataset is a small and a very\nchallenging dataset [27]. It is a collection of only 250 people captured from 2 views. Besides, the 2 images\nof every identity, there are 775 unmatched images, i.e. for these identities only 1 view is available. For our\nexperiments, we follow the protocol used by Liao and Li [5]. We randomly pick a set of 125 identities (who\nhave 2 views each) for training and leave the remaining 125 for testing. Additionally, the gallery image bank of\nthe test is enriched with the 775 unmatched images. This makes the ranking task even more challenging. We\ncompare the performance of both of our models with several baselines [11, 12, 15, 19, 22, 24]. For our models,\nwe use mini-batch sizes of 128 and train our models for about 20,000 iterations.\n\n1https://github.com/InnovArul/personreid_normxcorr\n\n6\n\n\fTable 2: Performance of different algorithms at ranks 1, 10, and 20 on CUHK01 100 Test Ids (left)\nand 486 Test Ids (right) Datasets\n\nMethod\n\nFused Model (ours)\nNorm X-Corr (ours)\nAhmed et al. [2]\nLi et al. [3]\nKISSME [18]\nLDML [14]\neSDC [25]\n\nr = 1\n81.23\n77.43\n65.00\n27.87\n29.40\n26.45\n22.84\n\nr = 10\n97.39\n96.67\n93.12\n73.46\n72.43\n72.04\n57.67\n\nr = 20\n98.60\n98.40\n97.20\n86.31\n86.07\n84.69\n69.84\n\nMethod\n\nFused Model (ours)\nNorm X-Corr (ours)\nCPDL [8]\nEnsembles [21]\nAhmed et al. [2]\nMirror-KFMA [20]\nMid-Level Filters [9]\n\nr = 1\n65.04\n60.17\n59.5\n51.9\n47.50\n40.40\n34.30\n\nr = 10\n89.76\n86.26\n89.70\n83.00\n80.00\n75.3\n65.00\n\nr = 20\n94.49\n91.47\n93.10\n89.40\n87.44\n84.10\n74.90\n\n4.2 Training Strategies for the Neural Network\n\nThe large number of parameters of a deep neural network necessitate special training strategies [2]. In this work,\nwe adopt 3 main strategies to train our model.\nData Augmentation: For almost all the datasets, the number of negative pairs far outnumbers the number of\npositive pairs in the training set. This poses a serious challenge to deep neural nets, which can over\ufb01t and get\nbiased in the process. Further, the positive samples may not have all the variations likely to be encountered in a\nreal scenario. We therefore, hallucinate positive pairs and enrich the training corpus, along the lines of Ahmed\net al. [2]. For every image in the training set of size W\u00d7H, we sample 2 images for CUHK03 (5 images for\nCUHK01 & QMUL) around the original image center and apply 2D translations chosen from a uniform random\ndistribution in the range of [\u22120.05W, 0.05W ] \u00d7 [\u22120.05H, 0.05H]. We also augment the data with images\nre\ufb02ected on a vertical mirror.\nFine-Tuning: For small datasets such as QMUL, training parameter-intensive models such as deep neural\nnetworks can be a signi\ufb01cant challenge. One way to mitigate this issue is to \ufb01ne-tune the model while training.\nWe start with a model pre-trained on a large dataset such as CUHK01 with 871 training identities rather than\nan untrained model and then re\ufb01ne this pre-trained model by training on the small dataset, QMUL in our case.\nDuring \ufb01ne-tuning, we use a learning rate of 0.001.\nOthers: Training deep neural networks is time taking. Therefore, to speed up the training, we implemented our\ncode such that it spawns threads across multiple GPUs.\n\n5 Results and Discussion\n\nCUHK03 Dataset: Table 1 summarizes the results of the experiments on the CUHK03 Labeled dataset. Our\nFused model outperforms the existing state-of-the-art models by a wide margin, of about 10% (about 72% vs.\n62%) on rank-1 accuracy, while the Normalized X-Corr model gives a 3% gain. This serves as a promising\nresponse to our key research endeavor for an effective deep learning model for person re-identi\ufb01cation. Further,\nboth models are signi\ufb01cantly better than Ahmed et al.\u2019s model [2]. We surmise that this is because our models\nare more adept at handling variations in illumination, partial occlusion and viewpoint change.\nInterestingly, we also note that the existing best performing system is a non-deep approach. This shows that\ndesigning an effective deep learning architecture is a fairly non-trivial task. Our models\u2019 performance, viz.-a-viz.\nnon-deep methods, once again underscores the bene\ufb01ts of learning representations from the data rather than\nusing hand-engineered ones. A visualization of some \ufb01lter responses of our model, some of the result plots, and\nsome ranked matching results may be found in the supplementary material.\nTable 1 also presents the results on the CUHK03 Detected dataset. Here too, we see the superior performance of\nour models over the existing state-of-the-art baselines. Interestingly, here our models take a wider lead over the\nexisting baselines (about 21%) and our models\u2019 performance rivals its own performance on the Labeled dataset.\nWe hypothesize that incorporating a wider search space makes our models more robust to the challenges posed\nby images in which the person is not centered, such as the CUHK03 Detected dataset.\nCUHK01 Dataset: Table 2 summarizes the results of the experiments on the CUHK01 dataset with 100 and\n486 test identities. For the 486 test identity setting, our models were pre-trained on the training set of the larger\nCUHK03 Labeled dataset and then \ufb01ne-tuned on the CUHK01-486 training set, owing to the paucity of training\ndata. As the tables show, our models give us a gain of upto 16% over the existing state-of-the-art on the rank-1\naccuracies.\nQMUL GRID Dataset: QMUL GRID is a challenging dataset for person re-identi\ufb01cation due to its small size\nand the additional 775 unmatched gallery images in the test set. This is evident from the low performances of\n\n7\n\n\fTable 3: Performance of different algorithms at ranks 1, 5, 10, and 20 on the QMUL GRID Dataset\n\nMethod\n\nFused Model (ours)\nNorm X-Corr (ours)\nKEPLER [24]\nLOMO+XQDA [22]\nPolyMap [11]\nMtMCML [19]\nMRank-RankSVM [15]\nMRank-PRDC [15]\nLCRML [12]\nXQDA [22]\n\nr = 1\n19.20\n16.00\n18.40\n16.56\n16.30\n14.08\n12.24\n11.12\n10.68\n10.48\n\nr = 5\n38.40\n32.00\n39.12\n33.84\n35.80\n34.64\n27.84\n26.08\n25.76\n28.08\n\nr = 10\n53.6\n40.00\n50.24\n41.84\n46.00\n45.84\n36.32\n35.76\n35.04\n38.64\n\nr = 20 Deep Learning Model\n66.4\n55.2\n61.44\n52.40\n57.60\n59.84\n46.56\n46.56\n46.48\n52.56\n\nYes\nYes\nNo\nNo\nNo\nNo\nNo\nNo\nNo\nNo\n\nexisting state-of-the-art algorithms. In order to train our models on this small dataset, we start with a model\ntrained on CUHK01 dataset with 100 test identities, then we \ufb01ne-tune the models on the QMUL GRID training\nset. Table 3 summarizes the results of the experiments on the QMUL GRID dataset. Here too, our Fused model\nperforms the best. Even though, our gain in rank-1 accuracy is a modest 1% but we believe this is signi\ufb01cant for\na challenging dataset like QMUL.\nThe ablation study across multiple datasets reveals that a wider search and inexact match each buy us at least\n6% individually, in terms of performance. The supplementary presents these results in more detail and also\ncompares the number of parameters across different models. Multi-GPU training, on the other hand gives us a\n3x boost to training speed.\n\n6 Conclusions and Future Work\n\nIn this work, we address the central research question of proposing simple yet effective deep-learning models for\nPerson Re-Identi\ufb01cation by proposing two new models. Our models are capable of handling the key challenges\nof illumination variations, partial occlusions and viewpoint invariance by incorporating inexact matching over a\nwider search space. Additionally, the proposed Normalized X-Corr model bene\ufb01ts from having fewer parameters\nthan the state-of-the-art deep learning model. The Fused model, on the other hand, allows us to cut down on\nfalse matches resulting from a wide matching search space, yielding superior performance.\nFor future work, we intend to use the proposed Siamese architectures for other matching tasks in Vision such\nas content-based image retrieval. We also intend to explore the effect of incorporating more feature maps on\nperformance.\n\nReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[2] Ejaz Ahmed, Michael Jones, and Tim K Marks. An improved deep learning architecture for person\nre-identi\ufb01cation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 3908\u20133916, 2015.\n\n[3] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deepreid: Deep \ufb01lter pairing neural network for person\nre-identi\ufb01cation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 152\u2013159, 2014.\n\n[4] David G Lowe. Distinctive image features from scale-invariant keypoints.\n\ncomputer vision, 60(2):91\u2013110, 2004.\n\nInternational journal of\n\n[5] Shengcai Liao and Stan Z Li. Ef\ufb01cient psd constrained asymmetric metric learning for person re-\nidenti\ufb01cation. In Proceedings of the IEEE International Conference on Computer Vision, pages 3685\u20133693,\n2015.\n\n[6] Rui Zhao, Wanli Ouyang, and Xiaogang Wang. Person re-identi\ufb01cation by salience matching.\n\nProceedings of the IEEE International Conference on Computer Vision, pages 2528\u20132535, 2013.\n\nIn\n\n[7] Michela Farenzena, Loris Bazzani, Alessandro Perina, Vittorio Murino, and Marco Cristani. Person\nre-identi\ufb01cation by symmetry-driven accumulation of local features. In Computer Vision and Pattern\nRecognition (CVPR), 2010 IEEE Conference on, pages 2360\u20132367. IEEE, 2010.\n\n8\n\n\f[8] Sheng Li, Ming Shao, and Yun Fu. Cross-view projective dictionary learning for person re-identi\ufb01cation. In\nProceedings of the 24th International Conference on Arti\ufb01cial Intelligence, AAAI Press, pages 2155\u20132161,\n2015.\n\n[9] Rui Zhao, Wanli Ouyang, and Xiaogang Wang. Learning mid-level \ufb01lters for person re-identi\ufb01cation. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 144\u2013151, 2014.\n\n[10] Bryan Prosser, Wei-Shi Zheng, Shaogang Gong, Tao Xiang, and Q Mary. Person re-identi\ufb01cation by\n\nsupport vector ranking. In BMVC, volume 2, page 6, 2010.\n\n[11] Dapeng Chen, Zejian Yuan, Gang Hua, Nanning Zheng, and Jingdong Wang. Similarity learning on an\nexplicit polynomial kernel feature map for person re-identi\ufb01cation. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition, pages 1565\u20131573, 2015.\n\n[12] Jiaxin Chen, Zhaoxiang Zhang, and Yunhong Wang. Relevance metric learning for person re-identi\ufb01cation\nby exploiting global similarities. In Pattern Recognition (ICPR), 2014 22nd International Conference on,\npages 1657\u20131662. IEEE, 2014.\n\n[13] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Information-theoretic metric\nlearning. In Proceedings of the 24th international conference on Machine learning, pages 209\u2013216. ACM,\n2007.\n\n[14] Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid. Is that you? metric learning approaches for\nface identi\ufb01cation. In 2009 IEEE 12th International Conference on Computer Vision, pages 498\u2013505.\nIEEE, 2009.\n\n[15] Chen Change Loy, Chunxiao Liu, and Shaogang Gong. Person re-identi\ufb01cation by manifold ranking. In\n\n2013 IEEE International Conference on Image Processing, pages 3567\u20133571. IEEE, 2013.\n\n[16] Wei-Shi Zheng, Shaogang Gong, and Tao Xiang. Reidenti\ufb01cation by relative distance comparison. IEEE\n\ntransactions on pattern analysis and machine intelligence, 35(3):653\u2013668, 2013.\n\n[17] Martin Hirzer, Peter M Roth, and Horst Bischof. Person re-identi\ufb01cation by ef\ufb01cient impostor-based\nmetric learning. In Advanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International\nConference on, pages 203\u2013208. IEEE, 2012.\n\n[18] Martin K\u00f6stinger, Martin Hirzer, Paul Wohlhart, Peter M Roth, and Horst Bischof. Large scale metric\nlearning from equivalence constraints. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE\nConference on, pages 2288\u20132295. IEEE, 2012.\n\n[19] Lianyang Ma, Xiaokang Yang, and Dacheng Tao. Person re-identi\ufb01cation over camera networks using\n\nmulti-task distance metric learning. IEEE Transactions on Image Processing, 23(8):3656\u20133670, 2014.\n\n[20] Ying-Cong Chen, Wei-Shi Zheng, and Jianhuang Lai. Mirror representation for modeling view-speci\ufb01c\n\ntransform in person re-identi\ufb01cation. In Proc. IJCAI, pages 3402\u20133408. Citeseer, 2015.\n\n[21] Sakrapee Paisitkriangkrai, Chunhua Shen, and Anton van den Hengel. Learning to rank in person re-\nidenti\ufb01cation with metric ensembles. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 1846\u20131855, 2015.\n\n[22] Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z Li. Person re-identi\ufb01cation by local maximal occurrence\nrepresentation and metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 2197\u20132206, 2015.\n\n[23] Wei Li, Rui Zhao, and Xiaogang Wang. Human reidenti\ufb01cation with transferred metric learning. In Asian\n\nConference on Computer Vision, pages 31\u201344. Springer, 2012.\n\n[24] Niki Martinel, Christian Micheloni, and Gian Luca Foresti. Kernelized saliency-based person re-\nidenti\ufb01cation through multiple metric learning. IEEE Transactions on Image Processing, 24(12):5645\u2013\n5658, 2015.\n\n[25] Rui Zhao, Wanli Ouyang, and Xiaogang Wang. Unsupervised salience learning for person re-identi\ufb01cation.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3586\u20133593,\n2013.\n\n[26] Dong Yi, Zhen Lei, Shengcai Liao, Stan Z Li, et al. Deep metric learning for person re-identi\ufb01cation. In\n\nICPR, volume 2014, pages 34\u201339, 2014.\n\n[27] Chen Change Loy, Tao Xiang, and Shaogang Gong. Multi-camera activity correlation analysis.\n\nIn\nComputer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1988\u20131995.\nIEEE, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1373, "authors": [{"given_name": "Arulkumar", "family_name": "Subramaniam", "institution": "IIT Madras"}, {"given_name": "Moitreya", "family_name": "Chatterjee", "institution": "IIT Madras"}, {"given_name": "Anurag", "family_name": "Mittal", "institution": "IIT Madras"}]}