{"title": "Incorporating Side Information by Adaptive Convolution", "book": "Advances in Neural Information Processing Systems", "page_first": 3867, "page_last": 3877, "abstract": "Computer vision tasks often have side information available that is helpful to solve the task. For example, for crowd counting, the camera perspective (e.g., camera angle and height) gives a clue about the appearance and scale of people in the scene. While side information has been shown to be useful for counting systems using traditional hand-crafted features, it has not been fully utilized in counting systems based on deep learning. In order to incorporate the available side information, we propose an adaptive convolutional neural network (ACNN), where the convolution filter weights adapt to the current scene context via the side information. In particular, we model the filter weights as a low-dimensional manifold within the high-dimensional space of filter weights. The filter weights are generated using a learned ``filter manifold'' sub-network, whose input is the side information. With the help of side information and adaptive weights, the ACNN can disentangle the variations related to the side information, and extract discriminative features related to the current context (e.g. camera perspective, noise level, blur kernel parameters). We demonstrate the effectiveness of ACNN incorporating side information on 3 tasks: crowd counting, corrupted digit recognition, and image deblurring. Our experiments show that ACNN improves the performance compared to a plain CNN with a similar number of parameters. Since existing crowd counting datasets do not contain ground-truth side information, we collect a new dataset with the ground-truth camera angle and height as the side information.", "full_text": "Incorporating Side Information by Adaptive\n\nConvolution\n\nDi Kang\n\nDebarun Dhar\n\n{dkang5-c, ddhar2-c}@my.cityu.edu.hk, abchan@cityu.edu.hk\n\nDepartment of Computer Science\n\nCity University of Hong Kong\n\nAntoni B. Chan\n\nAbstract\n\nComputer vision tasks often have side information available that is helpful to\nsolve the task. For example, for crowd counting, the camera perspective (e.g.,\ncamera angle and height) gives a clue about the appearance and scale of people\nin the scene. While side information has been shown to be useful for counting\nsystems using traditional hand-crafted features, it has not been fully utilized in\ncounting systems based on deep learning. In order to incorporate the available\nside information, we propose an adaptive convolutional neural network (ACNN),\nwhere the convolution \ufb01lter weights adapt to the current scene context via the\nside information. In particular, we model the \ufb01lter weights as a low-dimensional\nmanifold within the high-dimensional space of \ufb01lter weights. The \ufb01lter weights are\ngenerated using a learned \u201c\ufb01lter manifold\u201d sub-network, whose input is the side\ninformation. With the help of side information and adaptive weights, the ACNN can\ndisentangle the variations related to the side information, and extract discriminative\nfeatures related to the current context (e.g. camera perspective, noise level, blur\nkernel parameters). We demonstrate the effectiveness of ACNN incorporating side\ninformation on 3 tasks: crowd counting, corrupted digit recognition, and image\ndeblurring. Our experiments show that ACNN improves the performance compared\nto a plain CNN with a similar number of parameters. Since existing crowd counting\ndatasets do not contain ground-truth side information, we collect a new dataset\nwith the ground-truth camera angle and height as the side information.\n\nIntroduction\n\n1\nComputer vision tasks often have side information available that is helpful to solve the task. Here we\nde\ufb01ne \u201cside information\u201d as auxiliary metadata that is associated with the main input, and that affects\nthe appearance/properties of the main input. For example, the camera angle affects the appearance of\na person in an image (see Fig. 1 top). Even within the same scene, a person\u2019s appearance changes as\nthey move along the ground-plane, due to changes in the relative angles to the camera sensor. Most\ndeep learning methods ignore the side information, since if given enough data, a suf\ufb01ciently large\ndeep network should be able to learn internal representations that are invariant to the side information.\nIn this paper, we explore how side information can be directly incorporated into deep networks so as\nto improve their effectiveness.\nOur motivating application is crowd counting in images, which is challenging due to complicated\nbackgrounds, severe occlusion, low-resolution images, perspective distortion, and different appear-\nances caused by different camera tilt angles. Recent methods are based on crowd density estimation\n[1], where each pixel in the crowd density map represents the fraction of people in that location, and\nthe crowd count is obtained by integrating over a region in the density map. The current state-of-the-\nart uses convolutional neural networks (CNN) to estimate the density maps [2\u20134]. Previous works\nhave also shown that using side information, e.g., the scene perspective, helps to improve crowd\ncounting accuracy [5, 6]. In particular, when extracting hand-crafted features (e.g., edge and texture\nstatistics) [5\u20139] use scene perspective normalization, where a \u201cperspective weight\u201d is applied at each\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 2: The adaptive convolutional layer\nwith \ufb01lter manifold network (FMN). The\nFMN uses the auxiliary input to generate\nthe \ufb01lter weights, which are then convolved\nwith the input maps.\n\nFigure 1: (top) changes in people\u2019s appearance due to camera\nangle, and the corresponding changes in a convolution \ufb01lter; (bot-\ntom) the \ufb01lter manifold as a function of the camera angle. Best\nviewed in color.\npixel location during feature extraction, to adjust for the scale of the object at that location. To handle\nscale variations, typical CNN-based methods resize the input patch [2] based on the perspective\nweight, or extract features at different scales via multiple columns [3] or a pyramid of input patches\n[4]. However, incorporating other types of side information into the CNN is not as straightforward.\nAs a result, all the dif\ufb01culties due to various contexts, including different backgrounds, occlusion,\nperspective distortion and different appearances caused by different camera angles are entangled,\nwhich may introduce an extra burden on the CNNs during training. One simple solution is to add an\nextra image channel where each pixel holds the side information [10], which is equivalent to using\n1st-layer \ufb01lter bias terms that change with the side information. However, this may not be the most\neffective solution when the side information is a high-level property with a complex relationship with\nthe image appearance (e.g., the camera angle).\nOur solution in this paper is to disentangle the context variations explicitly in the CNN by modifying\nthe \ufb01lter weights adaptively. We propose an adaptive CNN (ACNN) that uses side information\n(e.g., the perspective weight) as an auxiliary input to adapt the CNN to different scene contexts\n(e.g., appearance changes from high/low angle perspectives, and scale changes due to distance).\nSpeci\ufb01cally, we consider the \ufb01lter weights in each convolutional layer as points on a low-dimensional\nmanifold, which is modeled using a sub-network where the side information is the input and the\n\ufb01lter weights are the outputs. The \ufb01lter manifold is estimated during training, resulting in different\nconvolution \ufb01lters for each scene context, which disentangles the context variations related to the\nside information. In the ACNN, the convolutional layers focus only on those features most suitable\nfor the current context speci\ufb01ed by the side information, as compared to traditional CNNs that use a\n\ufb01xed set of \ufb01lters over all contexts. In other words, the feature extractors are tuned for each context.\nWe test the effectiveness of ACNN at incorporating side information on 3 computer vision applica-\ntions. First, we perform crowd counting from images using an ACNN with the camera parameters\n(perspective value, or camera tilt angle and height) as side information. Using the camera parameters\nas side information, ACNN can perform cross-scene counting without a \ufb01ne-tuning stage. We collect\na new dataset covering a wide range of angles and heights, containing people from different view-\npoints. Second, we use ACNN for recognition of digit images that are corrupted with salt-and-pepper\nnoise, where the noise level is the side information. Third, we apply ACNN to image deburring,\nwhere the blur kernel parameters are the side information. A single ACNN can be trained to deblur\nimages for any setting of the kernel parameters. In contrast, using a standard CNN would require\ntraining a separate CNN for each combination of kernel parameters, which is costly if the set of\nparameter combinations is large. In our experiments, we show that ACNN can more effectively use\nthe side information, as compared to traditional CNNs with similar number of parameters \u2013 moving\nparameters from static layers to adaptive layers yields stronger learning capability and adaptability.\nThe contributions of this paper are three-fold: 1) We propose a method to incorporate the side\ninformation directly into CNN by using an adaptive convolutional layer whose weights are generated\nvia a \ufb01lter manifold sub-network with side information as the input; 2) We test the ef\ufb01cacy of ACNN\non a variety of computer vision applications, including crowd counting, corrupted digit recognition,\nand non-blind image deblurring, and show that ACNN is more effective than traditional CNNs with\n\n2\n\ncameraangle zimages5x5filter-10\u00b0-20\u00b0-30\u00b0-40\u00b0-50\u00b0-65\u00b0z=-10\u00b0z=-20\u00b0z=-30\u00b0z=-40\u00b0z=-50\u00b0z=-65\u00b0filter space \u211d5x5filter manifold\u2217input mapsoutput mapsFC, FC, \u2026, FCconvolutionauxiliary inputfilter weights + biasfilter manifold network\ud835\udc67\ud835\udc65\u210e=\ud835\udc53(\ud835\udc65\u2217\ud835\udc54\ud835\udc67;\ud835\udc64)\ud835\udc54\ud835\udc67;\ud835\udc64\fsimilar number of parameters. 3) We collect a new crowd counting dataset covering a wide range of\nviewpoints and its corresponding side information, i.e. camera tilt angle and camera height.\n2 Related work\n2.1 Adapting neural networks\nThe performance of a CNN is affected if the test set is not from the same data distribution as the\ntraining set [2]. A typical approach to adapting a CNN to new data is to select a pre-trained CNN\nmodel, e.g. AlexNet [11], VGG-net [12], or ResNet [13] trained on ImageNet, and then \ufb01ne-tune\nthe model weights for the speci\ufb01c task. [2] adopts a similar strategy \u2013 train the model on the whole\ndataset and then \ufb01ne-tune using a subset of image patches that are similar to the test scene.\nAnother approach is to adapt the input data cube so that the extracted features and the subsequent\nclassi\ufb01er/regressor are better matched. [14] proposes a trainable \u201cSpatial Transformer\u201d unit that\napplies an image transformation to register the input image to a standard form before the convolutional\nlayer. The functional form of the image transformation must be known, and the transformation\nparameters are estimated from the image. Because it operates directly on the image, [14] is limited to\n2D image transformations, which work well for 2D planar surfaces in an image (e.g., text on a \ufb02at\nsurface), but cannot handle viewpoint changes of 3D objects (e.g. people). In contrast, our ACNN\nchanges the feature extraction layers based on the current 3D viewpoint, and does not require the\ngeometric transformation to be known.\nMost related to our work are dynamic convolution [15] and dynamic \ufb01lter networks [16], which use\nthe input image to dynamically generate the \ufb01lter weights for convolution. However, their purpose\nfor dynamically generating \ufb01lters is quite different from ours. [15, 16] focus on image prediction\ntasks (e.g., predicting the next frame from the previous frames), and the dynamically-generated \ufb01lters\nare mainly used to transfer a pixel value in the input image to a new position in the output image\n(e.g., predicting the movement of pixels between frames). These input-speci\ufb01c \ufb01lters are suitable\nfor low-level tasks, i.e. the input and the output are both in the same space (e.g., images). But\nfor high-level tasks, dramatically changing features with respect to its input is not helpful for the\nend-goal of classi\ufb01cation or regression. In contrast, our purpose is to include side information into\nsupervised learning (regression and classi\ufb01cation), by learning how the discriminative image features\nand corresponding \ufb01lters change with respect to the side information. Hence, in our ACNN, the \ufb01lter\nweights are generated from an auxiliary input corresponding to the side information.\nHyperNetworks [17] use relaxed weight-sharing between layers/blocks, where layer weights are\ngenerated from a low-dimensional linear manifold. This can improve the expressiveness of RNNs, by\nchanging the weights over time, or reduce the number of learnable parameters in CNNs, by sharing\nweight bases across layers. Speci\ufb01cally, for CNNs, the weight manifold of the HyperNetwork is\nshared across layers, and the inputs/embedding vectors of the HyperNetwork are independently\nlearned for every layer during training. The operation of ACNNs is orthogonal to HyperNetworks - in\nACNN, the weight manifold is trained independently for each layer, and the input/side information is\nshared across layers. In addition, our goal is to incorporate the available side information to improve\nthe performance of the CNN models, which is not considered in [17].\nFinally, one advantage of [14\u201317] is that no extra information or label is needed. However, this also\nmeans they cannot effectively utilize the available side information, which is common in various\ncomputer vision tasks and has been shown to be helpful for traditional hand-crafted features [5].\n2.2 Crowd density maps\n[1] proposes the concept of an object density map whose integral over any region equals to the number\nof objects in that region. The spatial distribution of the objects is preserved in the density map, which\nalso makes it useful for detection [18, 19] and tracking [20]. Most of the recent state-of-the-art object\ncounting algorithms adopt the density estimation approach [2\u20134, 8, 21]. CNN-based methods [2\u20134]\nshow strong cross-scene prediction capability, due to the learning capacity of CNNs. Speci\ufb01cally,\n[3] uses a multi-column CNN with different receptive \ufb01eld sizes in order to encourage different\ncolumns to capture features at different scales (without input scaling or explicit supervision), while\n[4] uses a pyramid of input patches, each sent to separate sub-network, to consider multiple scales.\n[2] introduces an extra \ufb01ne-tuning stage so that the network can be better adapted to a new scene.\nIn contrast to [2, 3], we propose to use the existing side information (e.g. perspective weight) as an\ninput to adapt the convolutional layers to different scenes. With the adaptive convolutional layers,\n\n3\n\n\fonly the discriminative features suitable for the current context are extracted. Our experiments show\nthat moving parameters from static layers to adaptive layers yields stronger learning capability.\n\nImage deconvolution\n\n2.3\nExisting works [22\u201324] demonstrate that CNNs can be used for image deconvolution and restoration.\nWith non-blind deblurring, the blur kernel is known and the goal is to recover the original image.\n[23] concatenate a deep deconvolution CNN and a denoising CNN to perform deblurring and artifact\nremoval. However, [23] requires a separate network to be trained for each blur kernel family and\nkernel parameter. [24] trains a multi-layer perceptron to denoise images corrupted by additive white\nGaussian (AWG) noise. They incorporate the side information (AWG standard deviation) by simply\nappending it to the vectorized image patch input. In this paper, we use the kernel parameter as an\nauxiliary input, and train a single ACNN for a blur kernel family (for all its parameter values), rather\nthan for each parameter separately. During prediction, the \u201c\ufb01lter-manifold network\u201d uses the auxiliary\ninput to generate the appropriate deblurring \ufb01lters, without the need for additional training.\n\n3 Adaptive CNN\nIn this section, we introduce the adaptive convolutional layer and the ACNN.\n\n3.1 Adaptive convolutional layer\nConsider a crowd image dataset containing different viewpoints of people, and we train a separate\nCNN to predict the density map for each viewpoint. For two similar viewpoints, we expect that the\ntwo trained CNNs have similar convolution \ufb01lter weights, as a person\u2019s appearance varies gradually\nwith the viewpoint (see Fig. 1 top). Hence, as the viewpoint changes smoothly, the convolution\n\ufb01lters weights also change smoothly, and thus sweep a low-dimensional manifold within the high-\ndimensional space of \ufb01lter weights (see Fig. 1 bottom).\nFollowing this idea, we use an adaptive convolutional layer, where the convolution \ufb01lter weights\nare the outputs of a separate \u201c\ufb01lter-manifold network\u201d (FMN, see Fig. 2). In the FMN, the side\ninformation is an auxiliary input that feeds into fully-connected layers with increasing dimension\n(similar to the decoder stage of an auto-encoder) with the \ufb01nal layer outputting the convolution \ufb01lter\nweights. The FMN output is reshaped into a 4D tensor of convolution \ufb01lter weights (and bias), and\nconvolved with the input image. Note that in contrast to the traditional convolutional layer, whose\n\ufb01lter weights are \ufb01xed during the inference stage, the \ufb01lter weights of an adaptive convolutional layer\nchange with respect to the auxiliary input. Formally, the adaptive convolutional layer is given by\nh = f (x \u2217 g(z; w)), where z is the auxiliary input, g(\u00b7; w) is the \ufb01lter manifold network with tunable\nweights w, x is the input image, and f (\u00b7) is the activation function.1\nTraining the adaptive convolutional layer involves updating the FMN weights w, thus learning the\n\ufb01lter manifold as a function of the auxiliary input. During inference, the FMN interpolates along the\n\ufb01lter manifold using the auxiliary input, thus adapting the \ufb01lter weights of the convolutional layer to\nthe current context. Hence adaptation does not require \ufb01ne-tuning or transfer learning.\n\n3.2 Adaptive CNN for crowd counting\nWe next introduce the ACNN for crowd counting. Density map estimation is not as high-level a\ntask as recognition. Since the upper convolutional layers extract more abstract features, which are\nnot that helpful according to both traditional [1, 5] and deep methods [2, 3], we will not use many\nconvolutional layers. Fig. 3 shows our ACNN for density map estimation using two convolutional\nstages. The input is an image patch, while the output is the crowd density at the center of the patch.\nAll the convolutional layers use the ReLU activation, and each convolutional layer is followed by a\nlocal response normalization layer [11] and a max pooling layer. The auxiliary input for the FMN is\nthe perspective value for the image patch in the scene, or the camera tilt angle and camera height.\nFor the fully-connected stage, we use multi-task learning to improve the training of the feature\nextractors [2, 25\u201327]. In particular, the main regression task predicts the crowd density value, while\nan auxiliary classi\ufb01cation task predicts the number of people in the image patch.\nThe adaptive convolutional layer has more parameters than a standard convolutional layer with the\nsame number of \ufb01lters and the same \ufb01lter spatial size \u2013 the extra parameters are in the layers of the\n\n1To reduce clutter, here we do not show the bias term for the convolution.\n\n4\n\n\fLayer\nFMN1\nconv1\nFMN2\nconv2\nFC1\nFC2\nFC3\nFC4\nFC5\ntotal\n\nCNN\n\u2013\n\n1,664 (64)\n\n\u2013\n\n102,464 (64)\n2,654,720 (512)\n41,553 (81)\n82 (1)\n419,985 (81)\n1,312 (15)\n\n3,221,780\n\nACNN\n34,572 (832)\n0 (32)\n\n1,051,372 (25,632)\n\n0 (32)\n1,327,616 (512)\n41,553 (81)\n82 (1)\n210,033 (81)\n1,312 (15)\n\n2,666,540\n\nFigure 3: The architecture of our ACNN with adap-\ntive convolutional layers for crowd density estimation.\n\nTable 1: Comparison of number of parameters in\neach layer of the ACNN in Fig. 3 and an equivalent\nCNN. The number in parenthesis is the number of\nconvolution \ufb01lters, or the number of outputs of the\nFMN/fully-connected (FC) layer.\n\nFMN. However, since the \ufb01lters themselves adapt to the scene context, an ACNN can be effective\nwith fewer feature channels (from 64 to 32), and the parameter savings can be moved to the FMN\n(e.g. see Table 1). Hence, if side information is available, a standard CNN can be converted into\nan ACNN with a similar number of parameters, but with better learning capability. We verify this\nproperty in the experiments.\nSince most of the parameters of the FMN are in its last layer, the FMN has O(LF ) parameters, where\nF is the number of \ufb01lter parameters in the convolution layer and L is the size of the last hidden\nlayer of the FMN. Hence, for a large number of channels (e.g., 128 in, 512 out), the FMN will be\nextremely large. One way to handle more channels is to reduce the number of parameters in the FMN,\nby assuming that sub-blocks in the \ufb01nal weight matrix of the FMN form a manifold, which can be\nmodeled by another FMN (i.e., an FMN-in-FMN). Here, the auxiliary inputs for the sub-block FMNs\nare generated from another network whose input is the original auxiliary input.\n\n3.3 Adaptive CNN for image deconvolution\nOur ACNN for image deconvolution is based on the deconvolution CNN proposed in [23]. The\nACNN uses the kernel blur parameter (e.g., radius of the disk kernel) as the side information, and\nconsists of three adaptive convolutional layers (see Fig. 4). The ACNN uses 12 \ufb01lter channels in the\n\ufb01rst 2 layers, which yields an architecture with similar number of parameters as the standard CNN\nwith 38 \ufb01lters in [23]. The ACNN consists of two long 1D adaptive convolutional layers: twelve\n121\u00d71 vertical 1D \ufb01lters, followed by twelve 1\u00d7121 horizontal 1D \ufb01lters. The result is passed\nthrough a 1\u00d71 adaptive convolutional layer to fuse all the feature maps. The input is the blurred\nimage and the output target is the original image. We use leaky ReLU activations [28] for the \ufb01rst\ntwo convolutional layers, and sigmoid activation for the last layer to produce a bounded output as\nimage. Batch normalization layers [29] are used after the convolutional layers.\nDuring prediction, the FMN uses kernel parameter auxiliary input to generate the appropriate\ndeblurring \ufb01lters, without the need for additional training. Hence, the two advantages of using ACNN\nare: 1) only one network is needed for each blur kernel family, which is useful for kernels with too\nmany parameter combinations to enumerate; 2) by interpolating along the \ufb01lter manifold, ACNN can\nwork on kernel parameters unseen in the training set.\n\n4 Experiments\nTo show their potential, we evaluate ACNNs on three tasks: crowd counting, digit recognition with\nsalt-and-pepper noise, and image deconvolution (deblurring). In order to make fair comparisons,\nwe compare our ACNN with standard CNNs using traditional convolutional layers, but increase the\nnumber of \ufb01lter channels in the CNN so that they have similar total number of parameters as the\nACNN. We also test a CNN with side information included as an extra input channel(s) (denoted as\nCNN-X), where the side information is replicated in each pixel of the extra channel, as in [10].\nFor ACNN, each adaptive convolution layer has its own FMN, which is a standard MLP with two\nhidden layers and a linear output layer. The size of the FMN output layer is the same as the number\nof \ufb01lter parameters in its associated convolution layer, and the size of the last hidden layer (e.g., 40 in\nFig. 3) was selected so that the ACNN and baseline CNN have roughly equal number of parameters.\n\n5\n\nconv2(32x9x9)FC3(1)FC2(81)FC1(512)output densityauxiliary classification taskFC5(15)FC4(81)\u2217input image patch(1x33x33)conv1(32x17x17)auxiliary input: perspective value (1)filter weights(32x1x5x5)+32(10)(40)(832)FMN1(10)(40)(25632)FMN2\u2217filter weights(32x32x5x5)+32(1)(1)\fMethod\nMESA [1]\n\nRegression forest [21]\n\nRR [8]\n\nCNN-patch+RR [2]\n\nCNN (normalized patch)\n\nMCNN [3]\n\nCNN\nCNN-X\n\nACNN-v1\nACNN-v2\nACNN-v3\n\nMAE\n1.70\n1.70\n1.24\n1.70\n1.32\n1.26\n1.20\n1.26\n1.23\n1.14\n0.96\n\nTable 2: Comparison of mean absolute error (MAE)\nfor counting with crowd density estimation methods\non the UCSD \u201cmax\u201d split.\n\nMethod\nCNN\nCNN-X\nACNN-v1\nACNN-v2\nACNN-v3\n\nR1\n1.83\n1.33\n1.47\n1.22\n1.15\n\nR2 (unseen)\n\n1.06\n1.18\n0.95\n0.91\n1.02\n\nR3\n0.62\n0.61\n0.59\n0.55\n0.63\n\nAvg.\n1.17\n1.04\n1.00\n0.89\n0.93\n\nTable 3: Comparison of MAE on 3 bar regions on the\nUCSD \u201cmax\u201d split.\n\nFigure 4: ACNN for image deconvolution. The auxil-\niary input is the radius r of the disk blurring kernel.\n\nFigure 5: UCSD dataset with 3 bar regions. The range\nof perspective values are shown in parentheses.\n\n4.1 Crowd counting experiments\n\nFor crowd counting, we use two crowd counting datasets: the popular UCSD crowd counting dataset,\nand our newly collected dataset with camera tilt angle and camera height as side information.\n\n4.1.1 UCSD dataset\nRefer to Fig. 3 for the ACNN architecture used for the UCSD dataset. The image size is 238\u00d7158, and\n33\u00d733 patches are used. We test several variations of the ACNN: v1) only the \ufb01rst convolutional layer\nis adaptive, with 64 \ufb01lters for both of the convolutional layers; v2) only the last convolutional layer is\nadaptive, with 64 \ufb01lters for the \ufb01rst convolutional layer and 30 \ufb01lters for its second convolutional\nlayer; v3) all the convolutional layers are adaptive, with 32 \ufb01lters for all layers, which provides\nmaximum adaptability. The side information (auxiliary input) used for the FMN is the perspective\nvalue. For comparison, we also test a plain CNN and CNN-X with a similar architecture but using\nstandard convolutional layers with 64 \ufb01lters in each layer, and another plain CNN with input patch\nsize normalization introduced in [2] (i.e., resizing larger patches for near-camera regions). The\nnumbers of parameters are shown in Table 1. The count predictions in the region-of-interest (ROI)\nare evaluated using the mean absolute error (MAE) between the predicted count and the ground-truth.\nWe \ufb01rst use the widely adopted protocol of \u201cmax\u201d split, which uses 160 frames (frames 601:5:1400)\nfor training, and the remaining parts (frames 1:600, 1401:2000) for testing. The results are listed in\nTable 2. Our ACNN-v3, using two adaptive convolutional layers, offers maximum adaptability and\nhas the lowest error (0.96 MAE), compared to the equivalent plain CNN and the reference methods.\nWhile CNN-X reduces the error compared to CNN, CNN-X still has larger error than ACNN. This\ndemonstrates that the FMN of ACNN is better at incorporating the side information. In addition, using\nsimple input patch size normalization does not improve the performance as effectively as ACNN.\nExamples of the learned \ufb01lter manifolds are shown in Fig. 6. We also tested using 1 hidden layer in\nthe FMN, and obtained worse errors for each version of ACNN (1.74, 1.15, and 1.20, respectively).\nUsing only one hidden layer limits the ability to well model the \ufb01lter manifold.\nIn the next experiment we test the effect of the side information within the same scene. The ROI of\nUCSD is further divided into three bar regions of the same height (see Fig. 5). The models are trained\nonly on R1 and R3 from the training set, and tested on all three regions of the test set separately.\nThe results are listed in Table 3. After disentangling the variations due to perspective value, the\nperformance on R1 has been signi\ufb01cantly improved because the ACNN uses the context information\nto distinguish it from the other regions. Perspective values within R2 are completely unseen during\ntraining, but our ACNN still gives a comparable or slightly better performance than CNN, which\ndemonstrates that the FMN can smoothly interpolate along the \ufb01lter manifold.\n\n6\n\n\u2217input image(3x184x184)auxiliary input: blurring kernel parameterfilter weights(12x3x121x1)+12(4)(8)(4368)FMN1(4)(8)(17486)FMN2\u2217filter weights(12x12x1x121)+12(1)(1)(4)(8)(36)\u2217filter weights(3x12x1x1)(1)FMN3conv1(12x184x184)conv2(12x184x184)output image(3x184x184)R1 (6.7-13.2)R2 (13.2-17.7)R3 (17.6-22.1)\f6.7 \u00b7\u00b7\u00b7 \u00b7\u00b7\u00b7 9.7 \u00b7\u00b7\u00b7 \u00b7\u00b7\u00b7 12.6\u00b7\u00b7\u00b7 \u00b7\u00b7\u00b7 15.5\u00b7\u00b7\u00b7 \u00b7\u00b7\u00b7 18.5\u00b7\u00b7\u00b7 \u00b7\u00b7\u00b7 21.4\n\nFigure 6: Examples of learned \ufb01lter manifolds for the 2nd convolu-\ntional layer. Each row shows one \ufb01lter as a function of the auxiliary\ninput (perspective weight), shown at the top. Both the amplitude\nand patterns change, which shows the adaptability of the ACNN.\n\nImage\n\n-20.4\u25e6, 6.1m\n\nPredicted density map\n\n92.44 (1.57)\n\nTable 4: Counting results on CityUHK-X,\nthe new counting dataset with side infor-\nmation.\n-29.8\u25e6, 4.9m\n\nPredicted density map\n\n18.22 (2.47)\n\nImage\n\nMethod\n\nLBP+RR [2, 3]\n\nMCNN [3]\n\nCNN\n\nCNN-X (AH)\nCNN-X (AHP)\nACNN (AH)\nACNN (AHP)\n\nMAE\n23.97\n8.80\n8.72\n9.05\n8.45\n8.35\n8.00\n\n-39.8\u25e6, 6.7m\n\n28.99 (0.66)\n\n-55.2\u25e6, 11.6m\n\n21.71 (1.24)\n\nFigure 7: Examples of the predicted density map by our ACNN on the new CityUHK-X dataset. The extrinsic\nparameters and predicted count (absolute error in parenthesis) is shown above the images.\n\n4.1.2 CityUHK-X: new crowd dataset with extrinsic camera parameters\n\nThe new crowd dataset \u201cCityUHK-X\u201d contains 55 scenes (3,191 images in total), covering a camera\ntilt angle range of [-10\u25e6, -65\u25e6] and a height range of [2.2, 16.0] meters. The training set consists of\n43 scenes (2,503 images; 78,592 people), and the test set comprises 12 scenes (688 images; 28,191\npeople). More information and demo images can be found in the supplemental. The resolution\nof the new dataset is 512\u00d7384, and 65\u00d765 patches are used. The ACNN for this dataset contains\nthree convolutional and max-pooling layers, resulting in the same output feature map size after the\nconvolutional stage as in the ACNN for UCSD. The three adaptive convolutional layers use 40, 40\nand 32 \ufb01lters of size 5\u00d75 each. The side information (auxiliary inputs) are camera tilt angle and\ncamera height (denoted as \u201cAH\u201d), and the camera tilt angle, camera height, and perspective value\n(denoted as \u201cAHP\u201d). The baseline plain CNN and CNN-X use 64 \ufb01lters of size 5\u00d75 for all three\nconvolutional layers.\nResults for ACNN, the plain CNN and CNN-X, and multi-column CNN (MCNN) [3] are presented\nin Table 4. The plain CNN and MCNN [3], which do not use side information, obtain similar results.\nUsing side information with ACNN decreases the MAE, compared to the plain CNN and CNN-X,\nwith more side information improving the results (AHP vs. AH). Fig. 7 presents example results.\n\n4.2 Digit recognition with salt-and-pepper noise\n\nIn this experiment, the task is to recognize handwritten digits that are corrupted with different levels of\nsalt-and-pepper noise. The side information is the noise level. We use the MNIST handwritten digits\ndataset, which contains 60,000 training and 10,000 test examples. We randomly add salt-and-pepper\nnoise (half salt and half pepper), on the MNIST images. Nine noise levels are used on the original\nMNIST training set from 0% to 80% with an interval of 10%, with the same number of images for\neach noise level, resulting in a training set of 540,000 samples. Separate validation and test sets, both\ncontaining 90,000 samples, are generated from the original MNIST test set.\nWe test our ACNN with the noise level as the side information, as well as the plain CNN and CNN-X.\nWe consider two architectures: two or four convolutional layers (2-conv or 4-conv) followed by\n\n7\n\n\fNo. Conv. Filters\n\nNo. Parameters\n\nArchitecture\n\nCNN 2-conv\nCNN-X 2-conv\nACNN 2-conv\nCNN 4-conv\nCNN-X 4-conv\nACNN 4-conv\n\n32 + 32\n32 + 32\n32 + 26\n\n32 + 32 + 32 + 32\n32 + 32 + 32 + 32\n32 + 32 + 32 + 26\n\nError Rate\n\n8.66%\n8.49% (8.60%)\n7.55% (7.64%)\n3.58%\n3.57% (3.64%)\n2.92% (2.97%)\n\n113,386\n113,674\n105,712\n131,882\n132,170\n124,208\n\nTable 5: Digit recognition with salt-and-pepper noise, where the noise level is the side information. The number\nof \ufb01lters for each convolutional layer and total number of parameters are listed. In the Error Rate column, the\nparenthesis shows the error when using the estimated side information rather than the ground-truth.\n\nArch-\ufb01lters\nblurred image\n\ntraining set r\n\n\u2014\n\nCNN [23]\nCNN-X\nACNN\n\nCNN-X (blind)\nACNN (blind)\n\n{3, 7, 11}\n{3, 7, 11}\n{3, 7, 11}\n{3, 7, 11}\n{3, 7, 11}\n\nCNN [23]\nCNN-X\nACNN\n\nCNN-X (blind)\nACNN (blind)\n\n{3, 5, 7, 9, 11}\n{3, 5, 7, 9, 11}\n{3, 5, 7, 9, 11}\n{3, 5, 7, 9, 11}\n{3, 5, 7, 9, 11}\n\nr=7\n\nr=5\n\nr=3\nr=9 r=11\n23.42 21.90 20.96 20.28 19.74\n+0.55 -0.25 +0.49 +0.69 +0.56\n+0.88 -0.70 +1.65 +0.47 +1.86\n+0.77 +0.06 +1.17 +0.94 +1.28\n+0.77 -0.77 +1.23 +0.25 +0.98\n+0.76 -0.04 +0.70 +0.80 +1.13\n+0.28 +0.45 +0.62 +0.86 +0.59\n+0.99 +1.38 +1.53 +1.60 +1.55\n+0.71 +0.92 +1.00 +1.28 +1.22\n+0.91 +1.06 +0.81 +1.12 +1.24\n+0.66 +0.79 +0.64 +1.12 +1.04\n\nall\n\n21.26\n+0.41\n+0.83\n+0.84\n+0.49\n+0.67\n+0.56\n+1.41\n+1.03\n+1.03\n+0.85\n\nseen r unseen r\n\n\u2014\n\n\u2014\n\n+0.53 +0.22\n+1.46 -0.12\n+1.07 +0.50\n+0.99 -0.26\n+0.86 +0.38\n+0.56 \u2014\n+1.41 \u2014\n+1.03 \u2014\n+1.03 \u2014\n+0.85 \u2014\n\nTable 6: PSNRs for image deconvolution experiments. The PSNR for the blurred input image is in the \ufb01rst row,\nwhile the other rows are the change in PSNR relative to that of the blurred input image. Blind means the network\ntakes estimated auxiliary value (disk radius) as the side information.\ntwo fully-connected (FC) layers.2 For ACNN, only the 1st convolutional layer is adaptive. All\nconvolutional layers use 3\u00d73 \ufb01lters. All networks use the same con\ufb01guration for the FC layers, one\n128-neuron layer and one 10-neuron layer. ReLU activation is used for all layers, except the \ufb01nal\noutput layer which uses soft-max. Max pooling is used after each convolutional layer for the 2-conv\nnetwork, or after the 2nd and 4th convolutional layers for the 4-conv network.\nThe classi\ufb01cation error rates are listed in Table 5. Generally, adding side information as extra\ninput channel (CNN-X) decreases the error, but the bene\ufb01t diminishes as the baseline performance\nincreases \u2013 CNN-X 4-conv only decreases the error rate by 0.01% compared with CNN. Using ACNN\nto incorporate the side information can improve the performance more signi\ufb01cantly. In particular, for\nACNN 2-conv, the error rate decreases 0.94% (11% relatively) from 8.49% to 7.55%, while the error\nrate decreases 0.65% (18% relatively) from 3.57% to 2.92% for ACNN 4-conv.\nWe also tested the ACNN when the noise level is unknown \u2013 The noise level is estimated from the\nimage, and then passed to the ACNN. To this end, a 4-layer CNN (2 conv. layers, 1 max-pooling layer\nand 2 FC layers) is trained to predict the noise level from the input image. The error rate increases\nslightly when using the estimated noise level (e.g., by 0.05% for the ACNN 4-conv, see Table 5).\nMore detailed setting of the networks can be found in the supplemental.\n4.3\nIn the \ufb01nal experiment, we use ACNN for image deconvolution (deblurring) where the kernel blur\nparameter is the side information. We test on the Flickr8k [31] dataset, and randomly select 5000\nimages for training, 1400 images for validation, and another 1600 images for testing. The images\nwere blurred uniformly using a disk kernel, and then corrupted with additive Gaussian noise (AWG)\nand JPEG compression as in [23], which is the current state-of-the-art for non-blind deconvolution\nusing deep learning. We train the models with images blurred with different sets of kernel radii\nr \u2208 {3, 5, 7, 9, 11}. The test set consists of images blurred with all r \u2208 {3, 5, 7, 9, 11}. The\nevaluation is based on the peak signal-to-noise ratio (PSNR) between the deconvolved image and the\noriginal image, relative to the PSNR of the blurred image.\nThe results are shown in Table 6 using different sets of radii for the training set. First, when trained\non the full training set, ACNN almost doubles the increase in PSNR, compared to the CNN (+1.03dB\nvs. +0.56dB). Next, we consider a reduced training set with radii r \u2208 {3, 7, 11}, and ACNN again\ndoubles the increase in PSNR (+0.84dB vs. +0.41dB). The performance of ACNN on the unseen\nradii r \u2208 {5, 9} is better than CNN, which demonstrates the capability of ACNN to interpolate along\n2 On the clean MNIST dataset, the 2-conv and 4-conv CNN architectures achieve 0.81% and 0.69% error,\n\nImage deconvolution\n\nwhile the current state-of-the-art is \u223c0.23% error [30].\n\n8\n\n\fthe \ufb01lter manifold for unseen auxiliary inputs. Interestingly, CNN-X has higher PSNR than ACNN\non seen radii, but lower PSNR on unseen radii. CNN-X cannot well handle interpolation between\nunseen aux inputs, which shows the advantage of explicitly modeling the \ufb01lter manifold.\nWe also test CNN-X and ACNN for blind deconvolution, where we estimate the kernel radius using\nmanually-crafted features and random forest regression (see supplemental). For the blind task, the\nPSNR drops for CNN-X (0.38 on r \u2208 {3, 5, 7, 9, 11} and 0.34 on r \u2208 {3, 7, 11}) are larger than\nACNN (0.18 and 0.17), which means CNN-X is more sensitive to the auxiliary input.\nExample learned \ufb01lters are presented in Fig. 8, and Fig. 9 presents examples of deblurred images.\nDeconvolved images using CNN are overly-smoothed since it treats images blurred by all the kernels\nuniformly. In contrast, the ACNN result has more details and higher PSNR.\nOn this task, CNN-X performs better than ACNN on the seen radii, most likely because the relation-\nship between the side information (disk radius) and the main input (sharp image) is not complicated\nand deblurring is a low-level task. Hence, incorporating the side information directly into the \ufb01ltering\ncalculations (as an extra channel) is a viable solution3. In contrast, for the crowd counting and\ncorrupted digit recognition tasks, the relationship between the side information (camera angle/height\nor noise level) and the main input is less straightforward and not deterministic, and hence the more\ncomplex FMN is required to properly adapt the \ufb01lters. Thus, the adaptive convolutions are not univer-\nsally applicable, and CNN-X could be used in some situations where there is a simple relationship\nbetween the auxiliary input and the desired \ufb01lter output.\n\nFigure 8: Two examples of \ufb01lter manifolds for image deconvolution. The y-axis is the \ufb01lter weight, and x-axis\nis location. The auxiliary input is the disk kernel radius. Both the amplitude and the frequency can be adapted.\n\n(a) Original (target)\n\n(b) Blurred (input)\n\nPSNR=24.34\n\n(c) CNN [23]\nPSNR=25.30\n\n(d) ACNN\nPSNR=26.04\n\nFigure 9: Image deconvolution example: (a) original image; (b) blurred image with disk radius of 7; deconvolved\nimages using (c) CNN and (d) our ACNN.\n5 Conclusion\nIn this paper, we propose an adaptive convolutional neural network (ACNN), which employs the\navailable side information as an auxiliary input to adapt the convolution \ufb01lter weights. The ACNN\ncan disentangle variations related to the side information, and extract features related to the current\ncontext. We apply ACNN to three computer vision applications: crowd counting using either the\ncamera angle/height and perspective weight as side information, corrupted digit recognition using\nthe noise level as side information, and image deconvolution using the kernel parameter as side\ninformation. The experiments show that ACNN can better incorporate high-level side information to\nimprove performance, as compared to using simple methods such as including the side information\nas an extra input channel.\nThe placement of the adaptive convolution layers is important, and should consider the relationship\nbetween the image content and the aux input, i.e., how the image contents changes with respect to the\nauxiliary input. For example, for counting, the auxiliary input indicates the amount of perspective\ndistortion, which geometrically transforms the people\u2019s appearances, and thus adapting the 2nd layer\nis more helpful since changes in object con\ufb01guration are re\ufb02ected in mid-level features. In contrast,\nsalt-and-pepper-noise has a low-level (local) effect on the image, and thus adapting the \ufb01rst layer,\ncorresponding to low-level features, is suf\ufb01cient. How to select the appropriate convolution layers for\nadaptation is interesting future work.\n\n3The extra channel is equivalent to using an adaptive bias term for each \ufb01lter in the 1st convolutional layer.\n\n9\n\n0204060801001201.51.00.50.00.51.01.5020406080100120aux=3aux=5aux=7aux=9aux=111-D filter parametersparameter weights\fAcknowledgments\nThe work described in this paper was supported by a grant from the Research Grants Council of\nthe Hong Kong Special Administrative Region, China (Project No. [T32-101/15-R]), and by a\nStrategic Research Grant from City University of Hong Kong (Project No. 7004682). We gratefully\nacknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for\nthis research.\n\nReferences\n[1] V. Lempitsky and A. Zisserman, \u201cLearning To Count Objects in Images,\u201d in NIPS, 2010. 1, 3, 4, 6\n[2] C. Zhang, H. Li, X. Wang, and X. Yang, \u201cCross-scene Crowd Counting via Deep Convolutional Neural\n\nNetworks,\u201d in CVPR, 2015. 1, 2, 3, 4, 6, 7\n\n[3] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, \u201cSingle-Image Crowd Counting via Multi-Column\n\nConvolutional Neural Network,\u201d in CVPR, 2016. 2, 3, 4, 6, 7\n\n[4] D. Onoro-Rubio and R. J. L\u00f3pez-Sastre, \u201cTowards perspective-free object counting with deep learning,\u201d in\n\nECCV, 2016. 1, 2, 3\n\n[5] A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos, \u201cPrivacy preserving crowd monitoring: Counting people\n\nwithout people models or tracking,\u201d in CVPR.\n\nIEEE, 2008, pp. 1\u20137. 1, 3, 4\n\n[6] A. B. Chan and N. Vasconcelos, \u201cCounting people with low-level features and bayesian regression,\u201d IEEE\n\nTrans. Image Process., 2012. 1\n\n[7] \u2014\u2014, \u201cBayesian poisson regression for crowd counting,\u201d in ICCV, 2009.\n[8] C. Arteta, V. Lempitsky, J. A. Noble, and A. Zisserman, \u201cInteractive Object Counting,\u201d in ECCV, 2014. 3,\n\n6\n\n[9] H. Idrees, I. Saleemi, C. Seibert, and M. Shah, \u201cMulti-source multi-scale counting in extremely dense\n\n[10] M. Gharbi, G. Chaurasia, S. Paris, and F. Durand, \u201cDeep joint demosaicking and denoising,\u201d ACM\n\ncrowd images,\u201d in CVPR, 2013. 1\n\nTransactions on Graphics (TOG), 2016. 2, 5\n\n[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \u201cImagenet classi\ufb01cation with deep convolutional neural\n\n[12] K. Simonyan and A. Zisserman, \u201cVery Deep Convolutional Networks for Large-Scale Image Recognition,\u201d\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun, \u201cDeep Residual Learning for Image Recognition,\u201d in CVPR, 2016. 3\n[14] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, \u201cSpatial transformer networks,\u201d in NIPS,\n\nnetworks,\u201d in NIPS, 2012. 3, 4\n\nin ICLR, 2015. 3\n\n2015, pp. 2017\u20132025. 3\n\nCVPR, 2015. 3\n\n[15] B. Klein, L. Wolf, and Y. Afek, \u201cA Dynamic Convolutional Layer for short range weather prediction,\u201d in\n\n[16] B. De Brabandere, X. Jia, T. Tuytelaars, and L. Van Gool, \u201cDynamic \ufb01lter networks,\u201d in NIPS, 2016. 3\n[17] D. Ha, A. Dai, and Q. V. Le, \u201cHyperNetworks,\u201d in ICLR, 2017. 3\n[18] Z. Ma, L. Yu, and A. B. Chan, \u201cSmall Instance Detection by Integer Programming on Object Density\n\nMaps,\u201d in CVPR, 2015. 3\n\n[19] D. Kang, Z. Ma, and A. B. Chan, \u201cBeyond counting: Comparisons of density maps for crowd analysis\n\ntasks-counting, detection, and tracking,\u201d arXiv preprint arXiv:1705.10118, 2017. 3\n\n[20] M. Rodriguez, I. Laptev, J. Sivic, and J.-Y. Y. Audibert, \u201cDensity-aware person detection and tracking in\n\n[21] L. Fiaschi, R. Nair, U. Koethe, and F. a. Hamprecht, \u201cLearning to Count with Regression Forest and\n\n[22] D. Eigen, D. Krishnan, and R. Fergus, \u201cRestoring an image taken through a window covered with dirt or\n\n[23] L. Xu, J. S. Ren, C. Liu, and J. Jia, \u201cDeep Convolutional Neural Network for Image Deconvolution,\u201d in\n\n[24] H. C. Burger, C. J. Schuler, and S. Harmeling, \u201cImage denoising: Can plain neural networks compete with\n\n[25] S. Li, Z.-Q. Liu, and A. B. Chan, \u201cHeterogeneous Multi-task Learning for Human Pose Estimation with\n\nDeep Convolutional Neural Network,\u201d IJCV, 2015. 4\n\n[26] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, \u201cFacial Landmark Detection by Deep Multi-task Learning,\u201d in\n\ncrowds,\u201d in ICCV, 2011. 3\n\nStructured Labels,\u201d in ICPR, 2012. 3, 6\n\nrain,\u201d in ICCV, 2013. 4\n\nNIPS, 2014. 4, 5, 8, 9\n\nBM3D?\u201d in CVPR, 2012. 4\n\n[27] Y. Sun, X. Wang, and X. Tang, \u201cDeep Learning Face Representation by Joint Identi\ufb01cation-Veri\ufb01cation,\u201d\n\n[28] A. L. Maas, A. Y. Hannun, and A. Y. Ng, \u201cRecti\ufb01er Nonlinearities Improve Neural Network Acoustic\n\n[29] S. Ioffe and C. Szegedy, \u201cBatch Normalization: Accelerating Deep Network Training by Reducing Internal\n\nECCV, 2014.\n\nin NIPS, 2014. 4\n\nModels,\u201d in ICML, 2013. 5\n\nCovariate Shift,\u201d in ICML, 2015. 5\n\n10\n\n\f[30] D. Ciresan, U. Meier, and J. Schmidhuber, \u201cMulti-column Deep Neural Networks for Image Classi\ufb01cation,\u201d\n\nin CVPR, 2012, pp. 3642\u20133649. 8\n\n[31] M. Hodosh, P. Young, and J. Hockenmaier, \u201cFraming image description as a ranking task: Data, models\n\nand evaluation metrics,\u201d in Journal of Arti\ufb01cial Intelligence Research, 2013. 8\n\n11\n\n\f", "award": [], "sourceid": 2104, "authors": [{"given_name": "Di", "family_name": "Kang", "institution": "City University of Hong Kong"}, {"given_name": "Debarun", "family_name": "Dhar", "institution": "City University of Hong Kong"}, {"given_name": "Antoni", "family_name": "Chan", "institution": "City University of Hong Kong"}]}