{"title": "Nonlocal Neural Networks, Nonlocal Diffusion and Nonlocal Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 496, "page_last": 506, "abstract": "Nonlocal neural networks have been proposed and shown to be effective in several computer vision tasks, where the nonlocal operations can directly capture long-range dependencies in the feature space. In this paper, we study the nature of diffusion and damping effect of nonlocal networks by doing spectrum analysis on the weight matrices of the well-trained networks, and then propose a new formulation of the nonlocal block. The new block not only learns the nonlocal interactions but also has stable dynamics, thus allowing deeper nonlocal structures. Moreover, we interpret our formulation from the general nonlocal modeling perspective, where we make connections between the proposed nonlocal network and other nonlocal models, such as nonlocal diffusion process and Markov jump process.", "full_text": "Nonlocal Neural Networks, Nonlocal Diffusion and\n\nNonlocal Modeling\n\nYunzhe Tao\n\nSchool of Engineering and Applied Science\n\nColumbia University, USA\n\ny.tao@columbia.edu\n\nQiang Du\n\nSchool of Engineering and Applied Science\n\nColumbia University, USA\nqd2125@columbia.edu\n\nQi Sun\n\nBCSRC & USTC\n\nBeijing, China\n\nsunqi@csrc.ac.cn\n\nWei Liu\n\nTencent AI Lab\nShenzhen, China\n\nwl2223@columbia.edu\n\nAbstract\n\nNonlocal neural networks [25] have been proposed and shown to be effective in\nseveral computer vision tasks, where the nonlocal operations can directly capture\nlong-range dependencies in the feature space. In this paper, we study the nature of\ndiffusion and damping effect of nonlocal networks by doing spectrum analysis on\nthe weight matrices of the well-trained networks, and then propose a new formula-\ntion of the nonlocal block. The new block not only learns the nonlocal interactions\nbut also has stable dynamics, thus allowing deeper nonlocal structures. Moreover,\nwe interpret our formulation from the general nonlocal modeling perspective, where\nwe make connections between the proposed nonlocal network and other nonlocal\nmodels, such as nonlocal diffusion process and Markov jump process.\n\n1\n\nIntroduction\n\nDeep neural networks, especially convolutional neural networks (CNNs) [15] and recurrent neural\nnetworks (RNNs) [8], have been widely used in a variety of subjects [16]. However, traditional\nneural network blocks aim to learn the feature representations in a local sense. For example, both\nconvolutional and recurrent operations process a local neighborhood (several nearest neighboring\nneurons) in either space or time. Therefore, the long-range dependencies can only be captured\nwhen these operations are applied recursively, while those long-range dependencies are sometimes\nsigni\ufb01cant in practical learning problems, such as image or video classi\ufb01cation, text summarization,\nand \ufb01nancial market analysis [1, 6, 22, 27].\nTo address the above issue, a nonlocal neural network [25] has been proposed recently, which is able\nto improve the performance on a couple of computer vision tasks. In contrast to convolutional or\nrecurrent blocks, nonlocal operations [25] capture long-range dependencies directly by computing\ninteractions between each pair of positions in the feature space. Generally speaking, nonlocality is\nubiquitous in nature, and the nonlocal models and algorithms have been studied in various domains\nof physical, biological and social sciences [2, 5, 7, 23, 24].\nIn this work, we aim to study the nature of nonlocal networks, namely, what the nonlocal blocks\nhave exactly learned through training on a real-world task. By doing spectrum analysis on the\nweight matrices of the stacked nonlocal blocks obtained after training, we can largely quantify the\ncharacteristics of the damping effect of nonlocal blocks in a certain network.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fBased on the nature of diffusion observed from experiments, we then propose a new nonlocal neural\nnetwork which, motivated by the previous nonlocal modeling works, can be shown to be more\ngeneric and stable. Mathematically, we can make connections of the nonlocal network to a couple of\nexisting nonlocal models, such as nonlocal diffusion process and Markov jump process. The proposed\nnonlocal network allows a deeper nonlocal structure, while keeping the long-range dependencies\nlearned in a well-preserved feature space.\n\n2 Background\n\nNonlocal neural networks are usually employed and incorporated into existing cutting-edge model\narchitectures, such as the residual network (ResNet) and its variants, so that one can take full\nadvantage of nonlocal blocks in capturing long-range features. In this section, we brie\ufb02y review\nnonlocal networks in the context of traditional image classi\ufb01cation tasks and make a comparison\namong different neural networks. Note that while nonlocal operations are applicable to both space\nand time variables, we concentrate merely on the spatial nonlocality in this paper for brevity.\n\n2.1 Nonlocal Networks\nDenote by X = [X1, X2,\u00b7\u00b7\u00b7 , XM ] an input sample or the feature representation of a sample, with\nXi (i = 1,\u00b7\u00b7\u00b7 , M) being the feature at position i. A nonlocal block [25] is de\ufb01ned as\n\n\u03c9(Xi, Xj)g(Xj) .\n\n(1)\n\nZi = Xi +\n\nWZCi(X)\n\n(cid:88)\n\n\u2200j\n\nHere 1 \u2264 i \u2264 M, WZ is the weight matrix, and Zi is the output signal at position i. Computing\nZi depends on the features at possibly all positions j. Function g takes any feature Xj as input and\nreturns an embedded representation. The summation is normalized by a factor Ci(X). Moreover,\na pairwise function \u03c9 computes a scalar between the feature at position i and those at all possible\npositions j, which usually represents the similarity or af\ufb01nity. For simplicity, we only consider g in\nthe form of a linear embedding: g(Xj) = WgXj, with some weight matrix Wg. As is suggested in\n[25], the choices for the pairwise af\ufb01nity function \u03c9 can be, but not limited to, (embedded) Gaussian,\ndot product, and concatenation.\nWhen incorporating nonlocal blocks into a ResNet, a nonlocal network can be written as\n\nZ k+1 := Z k + F(Z k ; W k) ,\n\n(2)\nwhere W k is the parameter set, k = 0, 1,\u00b7\u00b7\u00b7 , K with K being the total number of network blocks,\nand Z k is the output signal at the k-th block with Z 0 = X, the input sample of the ResNet. On one\nhand, if a nonlocal block is employed, the i-th component of F will be\n\n(cid:2)F(Z k ; W k)(cid:3)\n\n(cid:88)\n\n\u2200j\n\ni =\n\nW k\nZCi(Z k)\n\n\u03c9(Z k\n\ni , Z k\n\nj )(W k\n\ng Z k\n\nj ) ,\n\n(3)\n\nwhere 1 \u2264 i \u2264 M and W k = {W k\nthe normalization scalar Ci(Z k) is de\ufb01ned as\n\nZ, W k\n\ng } includes the weight matrices to be learned. In addition,\n\nCi(Z k) =\n\n\u03c9(Z k\n\ni , Z k\n\nj ) ,\n\nfor i = 1,\u00b7\u00b7\u00b7 , M .\n\n(4)\n\n(cid:88)\n\n\u2200j\n\nOn the other hand, when the k-th block is a traditional residual block of, e.g., the pre-activation\nResNet [11], it contains two stacks of batch normalization (BN) [12], recti\ufb01ed linear unit (ReLU)\n[21], and weight layers, namely,\n\n2 f(cid:0)W k\n\n1 f (Z k)(cid:1) ,\n\nF(Z k; W k) = W k\n\n(5)\n2 } contains the weight matrices, and f = ReLU\u25e6 BN denotes the composition\n\nwhere W k = {W k\nof BN and ReLU.\n\n1 , W k\n\n2\n\n\f(a) 1 block\n\n(b) 2 blocks\n\n(c) 3 blocks\n\nFigure 1: Top 32 eigenvalues of the symmetric part of weight matrices with respect to nonlocal blocks.\nFigures (a), (b) and (c) correspond to the cases of adding 1, 2 and 3 nonlocal blocks to PreResNet-20,\nrespectively. Y-axis represents the values.\n\n2.2 A Brief Comparison with Existing Neural Networks\n\nThe most remarkable nature of nonlocal networks is that the nonlocal network can learn long-range\ncorrelations since the summation within a nonlocal block is taken over all possible candidates in the\ncurrent feature space. The pairwise af\ufb01nity function \u03c9 plays an important role in de\ufb01ning nonlocal\nblocks, which in some sense determines the level of nonlocality. On one hand, if \u03c9 is always positive,\nthen the feature space will be totally connected and every two features can interact with each other.\nOn the other hand, if \u03c9 is chosen to be a Dirac-delta function \u03b4ij, namely \u03c9(Xi, Xj) = 1 for i = j\nand is 0 otherwise, then the nonlocal block gets localized and will act like a simpli\ufb01ed residual block.\nBesides, as also mentioned in [25], the nature of nonlocal networks is different from other popular\nneural network architectures, such as CNN [15] and RNN [8]. The convolutional or recurrent\noperation usually takes the weighted sum over only nearest few neighboring inputs or the latest few\ntime steps, which is still of a local sense in contrast with the nonlocal operation.\n\n3 The Damping Effect of Nonlocal Networks\n\nIn this section, we \ufb01rst demonstrate the damping effect of nonlocal networks by presenting weight\nanalysis on the well-trained network. More speci\ufb01cally, we train the nonlocal networks for image\nclassi\ufb01cation on the CIFAR-10 dataset [14], which consists of 50k training images from 10 classes,\nand do the spectrum analysis on the weight matrices of nonlocal blocks after training. Based on the\nanalysis, we then propose a more suitable formulation of the nonlocal blocks, followed by some\nexperimental evaluations to demonstrate the effectiveness of the proposed model.\n\n3.1 Spectrum Analysis\n\nWe incorporate nonlocal blocks into the 20-layer pre-activation ResNet (PreResNet) [11] as stated\nin Eq. (2). Since our goal is to illustrate the diffusive nature of nonlocal operations, we add a\ndifferent number of nonlocal blocks into a \ufb01xed place at the early stage, which is different from the\nexperiments shown in [25], where the nonlocal blocks are added to different places along the ResNet.\nWhen employing the nonlocal blocks, we always insert them to right after the second residual block\nof PreResNet-20.\nThe inputs from CIFAR-10 are images of size 32 \u00d7 32, with preprocessing of the per-pixel mean\nsubtracted. In order to adequately train the nonlocal network, the training starts with a learning rate\n\n3\n\n01020301.51.00.50.00.501020301.51.00.50.001020302.01.51.00.50.001020301.00.50.001020302.01.51.00.50.001020302.01.51.00.50.00.5\f(a) Epoch 36\n\n(b) Epoch 82\n\n(c) Epoch 125\n\nFigure 2: Top 32 eigenvalues of the symmetrized weight matrices during training when adding 1\nnonlocal block. Figures (a), (b) and (c) correspond to the results at Epoch 36, 82 and 125, respectively.\n\n(a) Training curves\n\n(b) Eigenvalues\n\nFigure 3: (a) The training curves of nonlocal networks with 1, 2, 3 and 4 nonlocal blocks. (b) Top 32\neigenvalues of the symmetric part of weight matrices when adding 4 nonlocal blocks.\n\nof 0.1 that is subsequently divided by 10 at 81 and 122 epochs (around 32k and 48k iterations). A\nweight decay of 0.0001 and momentum of 0.9 are also used. We terminate the training at 164 epochs.\nThe model is with data augmentation [17] and trained on a single NVIDIA Tesla GPU.\nTo see what the nonlocal blocks have exactly learned, we extract the weight matrices WZ and\nWg after training. Under the current experimental setting, the weight matrices are of dimensions\nWZ \u2208 R64\u00d732 and Wg \u2208 R32\u00d764. Note that if we let W = WZWg, Eq. (3) can be rewritten as\n\n(cid:2)F(Z k ; W k)(cid:3)\n\n(cid:88)\n\n\u2200j\n\ni =\n\nW k\nCi(Z k)\n\n\u03c9(Z k\n\ni , Z k\n\nj )Z k\nj ,\n\n(6)\n\n(cid:102)W =\n\nW + W T\n\n,\n\nwhere the sparse weight matrix W k \u2208 R64\u00d764 has only 32 eigenvalues. We then denote the symmetric\npart of W by\n\n2\n\n(7)\n\nso that the eigenvalues of(cid:102)W k are all real. Note that the effect of W on the decay properties of the\npart of W , i.e.,(cid:102)W . Therefore, the eigenvalues of(cid:102)W k, especially those with the greatest magnitudes\n\nnetwork is determined by the associated quadratic form, which is equivalent to that of the symmetric\n\n(absolute values), would describe the characteristics of the weights in nonlocal blocks.\nFigure 1 shows the top 32 eigenvalues of the symmetric part of weight matrices after training when\nadding 1, 2 and 3 nonlocal blocks to PreResNet-20. One can observe that most of the eigenvalues\nare negative, especially the \ufb01rst several with the greatest magnitudes. Similar observations can be\nepochs before convergence for the 1-block nonlocal network case. From Eq. (2) with F being the\nformulation of nonlocal blocks in Eq. (6), we can see that Z k tends to vanish regardless of the initial\n\nmade along the training process. Figure 2 plots the top 32 eigenvalues of(cid:102)W at three intermediate\n\n4\n\n01020300.750.500.250.000.2501020301.51.00.50.00.501020301.51.00.50.00.5020406080100120140160Epochs0.250.500.751.001.251.501.752.00Loss1 block2 blocks3 blocks4 blocks051015202530105051015201st block051015202530864202462nd block051015202530321012343rd block0510152025302.01.51.00.50.00.51.04th block\fvalue when imposing small negative coef\ufb01cients on multiple blocks. Namely, while nonlocal blocks\nare trying to capture the long-range correlations, the features tend to be damped out at the same time,\nwhich is usually the consequence of diffusion. A more detailed discussion can be found in Section\n4.4.\nHowever, the training becomes more dif\ufb01cult when employing more nonlocal blocks. Under the\ncurrent training strategy, when adding 4 blocks, it does not converge after 164 epochs. Although\nreducing the learning rate or increasing the learning epochs will mitigate the convergence issue, the\ntraining loss decreases more slowly than the fewer blocks cases. Figure 3(a) shows the learning\ncurves of nonlocal networks with different nonlocal blocks, where we can see that the training loss for\nthe 4-block network is much larger than the others. Note that for the 4-block case, we have reduced\nthe learning rate by the ratio of 0.1 in order to make the training convergent. This observation implies\nthat the original nonlocal network is not robust with respect to multiple nonlocal blocks, which is also\nshown in Figure 3(b). For the 4-block case, we obtain many positive eigenvalues of the symmetrized\nweight matrices after training. In particular, some of the positive eigenvalues have large magnitudes,\nwhich leads to a potential blow-up of the feature vectors and thus makes training more dif\ufb01cult.\n\n3.2 A New Nonlocal Network\n\nGiven the damping effect for the nonlocal network and the instability of employing multiple nonlocal\nblocks, we hereby suggest to modify the formulation in order to have a more well-de\ufb01ned nonlocal\noperation. Instead of Eq. (3) or Eq. (6), we regard the nonlocal blocks added consecutively to the\nsame place as a nonlocal stage, which consists of several nonlocal sub-blocks. The nonlocal stage\ntakes input X = [Xi] (i = 1, 2,\u00b7\u00b7\u00b7 , M) as the feature representation when computing the af\ufb01nity\nwithin the stage. In particular, a nonlocal stage is de\ufb01ned as\n\n(cid:88)\n\n\u2200j\n\n(cid:88)\n\n\u2200j\n\nZ n+1\n\ni\n\n:= Z n\n\ni +\n\nW n\nCi(X)\n\n\u03c9(Xi, Xj)(Z n\n\nj \u2212 Z n\ni ) ,\n\n(8)\n\nfor i = 1, 2,\u00b7\u00b7\u00b7 , M, where W n is the weight matrix to be learned, Z 0 = X, and n = 1, 2,\u00b7\u00b7\u00b7 , N\nwith N being the number of stacking sub-nonlocal blocks in a stage. Moreover, Ci(X) is the\nnormalization factor computed by\n\nCi(X) =\n\n\u03c9(Xi, Xj) .\n\n(9)\n\nThere are a couple of differences between the two nonlocal formulations in Eq. (1) and Eq. (8).\nOn one hand, in a nonlocal stage in Eq. (8), the af\ufb01nity \u03c9 is pre-computed given the input feature\nX and stays the same along the propagation within a stage, which reduces the computational cost\nwhile \u03c9(Xi, Xj) can still represent the af\ufb01nity between Z n\nj to some extent. On the other\nhand, the residual part of Eq. (8) is no longer a weighted sum of the neighboring features, but the\ndifference between the neighboring signals and computed signal. In Section 4, we will study the\nconnections between the proposed nonlocal blocks and several other existing nonlocal models to\nclarify the rationale of our model.\n\ni and Z n\n\n3.3 Experimental Evaluation\n\nTo demonstrate the difference between two nonlocal networks and the effectiveness of our proposed\nmethod, we present the empirical evaluation on CIFAR-10 and CIFAR-100. Following the standard\npractice, we present experiments performed on the training set and evaluated on the test set as\nvalidation. We compare the empirical performance of PreResNets incorporating into the original\nnonlocal blocks [25] or the proposed nonlocal blocks in Eq. (8).\nTable 1 presents the validation errors of PreResNet-20 on CIFAR-10 with a different number of\nnonlocal blocks that are added consecutively to the same place or separately to different places as\nthe experiments shown in [25]. The best performance for each case is displayed in boldface. Note\nthat all the models are trained by ourselves in order to have a fair comparison. We run each model 5\ntimes and report the median. More experimental results with PreResNet-56 on CIFAR-100 are also\nprovided in Table 2. Based on the results shown in the tables, we give some analysis as follows.\nFirst, for the original nonlocal network, since the damping effect cannot be preserved when adding\nmore nonlocal blocks, the training becomes more dif\ufb01cult and thus the validation performs worse. In\n\n5\n\n\fTable 1: Validation errors of different models based on PreResNet-20 over CIFAR-10.\n\nModel\n\nbaseline\n\n2-block (original)\n3-block (original)\n4-block (original)\n2-block (proposed)\n3-block (proposed)\n4-block (proposed)\n5-block (proposed)\n6-block (proposed)\n3-block (original)\n3-block (proposed)\n\nThe Same Place\n\nDifferent Places\n\nError (%)\n\n8.19\n7.83\n8.28\n15.02\n7.74\n7.62\n7.37\n7.29\n7.55\n8.07\n7.33\n\nTable 2: Validation errors of different models based on PreResNet-56 over CIFAR-100.\n\nPreResNet-56\n\nModel\n\nbaseline\n\n2-block (original)\n3-block (original)\n4-block (original)\n2-block (proposed)\n3-block (proposed)\n4-block (proposed)\n5-block (proposed)\n6-block (proposed)\n\nError (%)\n\n26.57\n26.13\n26.26\n34.89\n26.04\n25.57\n25.43\n25.29\n25.49\n\n(a) The 1st block\n\n(b) The 2nd block\n\n(c) The 3rd block\n\n(d) The 4th block\n\nFigure 4: Top 32 eigenvalues of the symmetric part of weight matrices after training when adding 4\nproposed nonlocal blocks to PreResNet-20.\n\ncontrast, the proposed nonlocal network is more robust to the number of blocks. When employing the\nsame number of nonlocal blocks, the proposed model consistently performs better than the original\none. Moreover, the proposed model also reduces the computational cost because the af\ufb01nity function\ncan be pre-assigned in a nonlocal stage. Although the two formulations of nonlocal blocks appear\nsimilar, they are different in essence. Figure 4 shows the top 32 eigenvalues of the symmetric part of\nweight matrices after training when adding 4 new nonlocal blocks to PreResNet-20, where we can see\nthat most of the eigenvalues are positive. In the following section, we will show that the formulation\nof new nonlocal blocks is exactly the nonlocal analogue of local diffusion with positive coef\ufb01cients.\nIn addition, one can observe from Table 1 and Table 2 that adding too many proposed nonlocal blocks\nreduces the performance. A potential reason is that all the nonlocal blocks in a stage share the same\naf\ufb01nity function, which is pre-computed given the input features. Therefore, the kernel function\ncannot precisely describe the af\ufb01nity between pairs of features after many sub-blocks.\nNote that in [25], the nonlocal blocks are always added individually in each \ufb01xed place of ResNets.\nWe also compare the two nonlocal networks in the second part of Table 1, where we in total add 3\nnonlocal blocks to PreResNet-20 but into different residual blocks. Although in this case the two\nkinds of nonlocal blocks have the same af\ufb01nity kernel, the proposed nonlocal network still performs\n\n6\n\n01020300.50.00.51.01.501020300.50.00.51.01.52.001020300.50.00.51.01.52.00102030101\fbetter than the original one, which implies that the proposed nonlocal blocks are more suitably de\ufb01ned\nwith respect to the nature of diffusion.\n\n4 Connection to Nonlocal Modeling\n\nIn this section, we study the relationship between the proposed nonlocal network and a couple of\nexisting nonlocal models, i.e., the nonlocal diffusion systems and Markov chain with jump processes.\nThen we make a further comparison between the two formulations of nonlocal blocks.\n\n4.1 Af\ufb01nity Kernels\n\nWe make some assumptions here on the af\ufb01nity kernel \u03c9 in order to make connections to other\nnonlocal modeling approaches. Let \u0393 be the input feature space such that Xi\u2019s are \ufb01nite data samples\ndrawn from \u0393. Denote by K the kernel matrix such that Kij = \u03c9(Xi, Xj). We \ufb01rst assume that K\nhas a \ufb01nite Frobenius norm, namely,\n(cid:107)K(cid:107)2\n\n(cid:12)(cid:12)\u03c9(Xi, Xj)(cid:12)(cid:12)2\n\n(cid:88)\n\n< \u221e .\n\n(10)\n\nF =\n\nThen without loss of generality, we can assume that the kernel matrix K has sum 1 along its rows,\ni.e., the normalization factor is assumed to be 1 for all i:\n\nCi(X) =\n\n\u03c9(Xi, Xj) = 1 ,\n\nfor i = 1,\u00b7\u00b7\u00b7 , M .\n\n(11)\n\n\u2200i,j\n\n(cid:88)\n\n\u2200j\n\nWe further assume that the kernel function is symmetric and nonnegative within \u0393, namely for all\nXi, Xj \u2208 \u0393,\n\n\u03c9(Xi, Xj) = \u03c9(Xj, Xi) and \u03c9(Xi, Xj) \u2265 0 .\n\n(12)\nNote that for the instantiations given in [25], only the Gaussian function is symmetric. However, in\nthe embedded Gaussian and dot product cases, we can instead embed the input features by some\nweight matrix as the parameter and then feed the pre-embedded features into the nonlocal stage and\nuse the traditional Gaussian or dot product function as the af\ufb01nity, namely, we replace the kernel by\n(13)\n\n\u03c9\u03b8(Xi, Xj) = \u03c9(cid:0)\u03b8(Xi), \u03b8(Xj)(cid:1) ,\n\nwhere \u03b8 is a linear embedding function.\n\n4.2 Nonlocal Diffusion Process\nDe\ufb01ne the following discrete nonlocal operator for X = [X1,\u00b7\u00b7\u00b7 , XM ] and Xi \u2208 \u0393:\n\n(LhZ)i :=\n\n\u03c9(Xi, Xj)(Zj \u2212 Zi) ,\n\n(14)\n\n(cid:88)\n\n\u2200j\n\nwhere the superscript h is the discretization parameter. Eq. (14) can be seen as a reformulation of\nnonlocal operators in some previous works, such as the graph Laplacian [4] (as well as its applications\n[18\u201320, 28]), nonlocal-type image processing algorithms [3, 10], and diffusion maps [5]. All of them\nare nonlocal analogues [7] of local diffusions. Then the equation\n\nZ n+1 = Z n + LhZ n\n\n(15)\ndescribes a discrete nonlocal diffusion, where Z satis\ufb01es that Z 0 = X for Xi \u2208 \u0393. The above\nequation is equivalent to the proposed nonlocal network in Eq. (8) with positive weights.\nIn general, Eq. (15) is a time-and-space discrete form, where the superscript n is the time step\nparameter, of the following nonlocal integro-differential equation:\n\n(16)\nfor x \u2208 \u2126 and t \u2265 0. Here u is the initial condition, zt := \u2202z/\u2202t, and L de\ufb01nes a continuum\nnonlocal operator:\n\n(cid:26)zt(x, t) \u2212 Lz(x) = 0 ,\n(cid:90)\n\nz(x, 0) = u(x) ,\n\n\u03c1(x, y)(cid:0)z(y) \u2212 z(x)(cid:1)dy ,\n\nLz(x) :=\n\n(17)\n\n\u2126\n\n7\n\n\f\u03c1(x, y) := \u03c9(cid:0)u(x), u(y)(cid:1)\n\nwhere the kernel\n\n(cid:90)\n\n(cid:90)\n\n(18)\nis symmetric and positivity-preserved. We provide in the supplementary material some properties of\nthe nonlocal equation (16), in order to illustrate the connection between Eq. (16) and local diffusions,\nand demonstrate that the proposed nonlocal blocks can be viewed as an analogue of local diffusive\nterms.\nSince we assume that the kernel matrix K has a \ufb01nite Frobenius norm, we have\n\n|\u03c1(x, y)|2dxdy < \u221e.\n\n\u2126\n\n\u2126\n\n(19)\nTherefore, the corresponding integral operator L is a Hilbert-Schmidt operator so that Eq. (16) and its\ntime reversal are both well-posed and stable in \ufb01nite time. The latter is equivalent to the continuum\ngeneralization of the nonlocal block with a negative coef\ufb01cient (eigenvalue). Although L is a nonlocal\nanalogue of the local Laplace (diffusion) operator, the nonlocal diffusion equation has its own merit.\nMore speci\ufb01cally, the anti-diffusion in a local partial differential equation (PDE), i.e., the reversed\nheat equation, is ill-posed and unstable in \ufb01nite time. The stability property of the proposed nonlocal\nnetwork is important, because it ensures that we can stack multiple nonlocal blocks within a single\nstage to fully exploit their advantages in capturing long-range features.\n\n4.3 Markov Jump Process\n\nThe proposed nonlocal network in Eq. (8) also shares some common features with the discrete-time\nMarkov chain with jump processes [7]. In this part, we assume, without loss of generality, that Zi\u2019s\nand Xi\u2019s are scalars. In general, the component-wise properties can lead to similar properties of the\nvector \ufb01eld. Given a Markov jump process Zt con\ufb01ned to remain in a bounded domain \u2126 \u2282 Rd,\nassume that z(x, t) is the corresponding probability density function. Then a general master equation\nto describe the evolution of z [13] can be written as\n\nzt(x, t) =\n\n[\u03b3(x(cid:48), x, t)z(x(cid:48), t) \u2212 \u03b3(x, x(cid:48), t)z(x, t)] dx(cid:48) ,\n\n(20)\n\nwhere \u03b3(x(cid:48), x, t) denotes the transition rate from x(cid:48) to x at time t. Assume that the Markov process\nis time-homogeneous, namely \u03b3(x(cid:48), x, t) = \u03b3(x(cid:48), x). Then in the discrete time form [29], Eq. (20)\nis often reformulated as\n\nz(x, t) \u2212 z(x, t(cid:48)) =\n\n(21)\nwhere p(x(cid:48), x) := (t \u2212 t(cid:48))\u03b3(x(cid:48), x) represents the transition probability of a particle moving from x(cid:48)\nto x. One can easily see that as t(cid:48) \u2192 t, the solution to Eq. (21) converges to the continuum solution\n\n[p(x(cid:48), x)z(x(cid:48), t(cid:48)) \u2212 p(x, x(cid:48))z(x, t(cid:48))] dx(cid:48) ,\n\n\u2126\n\nto Eq. (20). Note that(cid:82)\n\n\u2126 p(x, x(cid:48))dx(cid:48) = 1, Eq. (21) reduces to\n\n(cid:90)\n\np(x(cid:48), x)z(x(cid:48), t(cid:48))dx(cid:48) .\n\nz(x, t) =\n\n(22)\nSuppose that there is a set of \ufb01nite states drawn from \u2126, namely x1, x2,\u00b7\u00b7\u00b7 , xM , and \ufb01nite discrete\ntime intervals, namely t1, t2,\u00b7\u00b7\u00b7 , tN . Let Z n\ni = Xi. If we\nfurther introduce the kernel matrix K, where Kij := 1M(x(cid:48)\nj ) p(xi, x(cid:48))dx(cid:48), and the average of\nj. Then the kernel matrix K is a Markov matrix (with each column\np(xi, x(cid:48)) over the grid mesh of x(cid:48)\nsumming up to 1) and\n\ni = z(xi, tn) with an initial condition Z 0\n\nM(x(cid:48)\n\n(cid:82)\n\nj )\n\n\u2126\n\n(cid:90)\n\n(cid:90)\n\n\u2126\n\ndescribes a discrete-time Markov jump process. The above system is equivalent to\n\nZ n+1\n\ni\n\n=\n\n(cid:88)\n(cid:88)\n\n\u2200j\n\n\u2200j\n\nKijZ n\n\nj\n\n(cid:0)Z n\n\n(cid:1) ,\n\nZ n+1\n\ni\n\n\u2212 Z n\n\ni =\n\nKij\n\nj \u2212 Z n\n\ni\n\n(23)\n\n(24)\n\nwhich is exactly the proposed nonlocal stage regardless of the weight coef\ufb01cients in front of the\nMarkov operator.\n\n8\n\n\fOn the other hand, our proposed nonlocal network with negative weight matrices can be regraded as\nthe reverse Markov jump process. Again, since we have the condition that\n\n(25)\nthe Markov operator K is a Hilbert-Schmidt operator, which is bounded and implies that the Markov\njump processes with both time directions are stable in \ufb01nite time.\n\n(cid:107)K(cid:107)2\n\nF < \u221e ,\n\n4.4 Further Discussion\n\nIn this section, we have derived some properties of the proposed nonlocal network. Due to the nature\nof a Hilbert-Schmidt operator, we expect the robustness of our network. However, we should remark\nthat whether the exact discrete formula in Eq. (8) has stable dynamics also depends on the weights\nW n\u2019s. This is ignored for simplicity when connecting to other nonlocal models. In practice, the\nstability holds as long as the weight parameters are small enough such that the CFL condition is\nsatis\ufb01ed.\nIn comparison, for the original nonlocal network, we de\ufb01ne a discrete nonlinear nonlocal operator as\n\n\u03c9(Zi, Zj)Zj .\n\n(26)\n\n( \u02dcLhZ)i := \u2212(cid:88)\n\n\u2200j\n\nZ n+1 \u2212 Z n = \u02dcLhZ n ,\n\nThen the \ufb02ow of the original nonlocal network with small negative coef\ufb01cients can be described as\n(27)\nwith the initial condition Z 0 = X. By letting Z n+1 = Z n, the steady-state equation of Eq. (27) can\nbe written as\n(28)\nSince the kernel \u03c9 is strictly non-negative, the only steady-state solution is Z \u2261 0, which means\nthat the output signals of original nonlocal blocks tend to be damped out (diffused) along iterations.\nHowever, the original nonlocal operation with positive eigenvalues is unstable in \ufb01nite time. While\none can still learn long-range features in the network, stacking several original blocks will cast\nuncertainty to the model and thus requires extra work to study the initialization strategy or to \ufb01ne-tune\nthe parameters.\nIn applications, the nonlocal stage is usually employed and plugged into ResNets. From the viewpoint\nof PDEs, the \ufb02ow of a ResNet can be seen as the forward Euler discretization of a dynamical system\nwith respect to z(t) [26]:\n\n\u02dcLhZ = 0 .\n\n= F(cid:0)z, W (t)(cid:1),\n\ndz\ndt\n\nz(0) = X .\n\nF(cid:0)z, W (t)(cid:1) = W2(t)f (W1(t)f (z)) ,\n\n(29)\n\nThe right-hand side F for a PreResNet is given by\n\n(30)\nwhich can be regarded as a reaction term in a PDE because it does not involve any differential operator\nexplicitly. From the standard PDE theory, incorporating the proposed nonlocal stage to ResNets is\nequivalent to introducing diffusion terms to the reaction system. As a result, the diffusion terms can\nregularize the PDE and thus make it more stable [9].\n\n5 Conclusion\n\nIn this paper, we studied the damping effect of the existing nonlocal networks in the literature by\nperforming the spectrum analysis on the weight matrices of the well-trained networks. Then we\nproposed a new class of nonlocal networks, which can not only capture the long-range dependencies\nbut also be shown to be more stable in the network dynamics and more robust to the number\nof nonlocal blocks. Therefore, we can stack more nonlocal blocks in order to fully exploit their\nadvantages. In the future, we aim to investigate the proposed nonlocal network on other challenging\nreal-world learning tasks.\n\nAcknowledgments\n\nThis work is supported in part by US NSF CCF-1740833, DMR-1534910 and DMS-1719699. Y. Tao\nwants to thank Tingkai Liu and Xinpeng Chen for their help on the experiments.\n\n9\n\n\fReferences\n[1] Beran, Jan, Sherman, Robert, Taqqu, Murad S, and Willinger, Walter. Long-range dependence in variable-\n\nbit-rate video traf\ufb01c. IEEE Transactions on communications, 43(234):1566\u20131579, 1995.\n\n[2] Buades, Antoni, Coll, Bartomeu, and Morel, J-M. A non-local algorithm for image denoising. In Computer\nVision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pp.\n60\u201365. IEEE, 2005.\n\n[3] Buades, Antoni, Coll, Bartomeu, and Morel, Jean-Michel. A review of image denoising algorithms, with a\n\nnew one. Multiscale Modeling & Simulation, 4(2):490\u2013530, 2005.\n\n[4] Chung, Fan RK. Spectral graph theory. Number 92. American Mathematical Soc., 1997.\n\n[5] Coifman, Ronald R and Lafon, St\u00e9phane. Diffusion maps. Applied and computational harmonic analysis,\n\n21(1):5\u201330, 2006.\n\n[6] Cont, Rama. Long range dependence in \ufb01nancial markets. In Fractals in engineering, pp. 159\u2013179.\n\nSpringer, 2005.\n\n[7] Du, Qiang, Gunzburger, Max, Lehoucq, Richard B, and Zhou, Kun. Analysis and approximation of\n\nnonlocal diffusion problems with volume constraints. SIAM review, 54(4):667\u2013696, 2012.\n\n[8] Elman, Jeffrey L. Distributed representations, simple recurrent networks, and grammatical structure.\n\nMachine learning, 7(2-3):195\u2013225, 1991.\n\n[9] Evans, Lawrence C. Partial differential equations and monge-kantorovich mass transfer. Current develop-\n\nments in mathematics, 1997(1):65\u2013126, 1997.\n\n[10] Gilboa, Guy and Osher, Stanley. Nonlocal linear image regularization and supervised segmentation.\n\nMultiscale Modeling & Simulation, 6(2):595\u2013630, 2007.\n\n[11] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Identity mappings in deep residual networks.\n\nIn European Conference on Computer Vision, pp. 630\u2013645. Springer, 2016.\n\n[12] Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In International Conference on Machine Learning, pp. 448\u2013456, 2015.\n\n[13] Kenkre, VM, Montroll, EW, and Shlesinger, MF. Generalized master equations for continuous-time random\n\nwalks. Journal of Statistical Physics, 9(1):45\u201350, 1973.\n\n[14] Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. 2009.\n\n[15] LeCun, Yann, Bengio, Yoshua, et al. Convolutional networks for images, speech, and time series. The\n\nhandbook of brain theory and neural networks, 3361(10):1995, 1995.\n\n[16] LeCun, Yann, Bengio, Yoshua, and Hinton, Geoffrey. Deep learning. nature, 521(7553):436, 2015.\n\n[17] Lee, Chen-Yu, Xie, Saining, Gallagher, Patrick, Zhang, Zhengyou, and Tu, Zhuowen. Deeply-supervised\n\nnets. In Arti\ufb01cial Intelligence and Statistics, pp. 562\u2013570, 2015.\n\n[18] Liu, Wei and Chang, Shih-Fu. Robust multi-class transductive learning with graphs. CVPR, 2009.\n\n[19] Liu, Wei, He, Junfeng, and Chang, Shih-Fu. Large graph construction for scalable semi-supervised learning.\nIn Proceedings of the 27th international conference on machine learning (ICML-10), pp. 679\u2013686, 2010.\n\n[20] Liu, Wei, Wang, Jun, and Chang, Shih-Fu. Robust and scalable graph-based semisupervised learning.\n\nProceedings of the IEEE, 100(9):2624\u20132638, 2012.\n\n[21] Nair, Vinod and Hinton, Geoffrey E. Recti\ufb01ed linear units improve restricted boltzmann machines. In\n\nProceedings of the 27th international conference on machine learning (ICML-10), pp. 807\u2013814, 2010.\n\n[22] Pipiras, Vladas and Taqqu, Murad S. Long-range dependence and self-similarity, volume 45. Cambridge\n\nUniversity Press, 2017.\n\n[23] Silling, Stewart A. Reformulation of elasticity theory for discontinuities and long-range forces. Journal of\n\nthe Mechanics and Physics of Solids, 48(1):175\u2013209, 2000.\n\n[24] Tadmor, Eitan. Mathematical aspects of self-organized dynamics: consensus, emergence of leaders, and\n\nsocial hydrodynamics. SIAM News, 48(9), 2015.\n\n10\n\n\f[25] Wang, Xiaolong, Girshick, Ross, Gupta, Abhinav, and He, Kaiming. Non-local neural networks. CVPR,\n\n2018.\n\n[26] Weinan, E. A proposal on machine learning via dynamical systems. Communications in Mathematics and\n\nStatistics, 5(1):1\u201311, 2017.\n\n[27] Willinger, Walter, Paxson, Vern, Riedi, Rolf H, and Taqqu, Murad S. Long-range dependence and data\n\nnetwork traf\ufb01c. Theory and applications of long-range dependence, pp. 373\u2013407, 2003.\n\n[28] Wu, Baoyuan, Jia, Fan, Liu, Wei, Ghanem, Bernard, and Lyu, Siwei. Multi-label learning with missing\n\nlabels using mixed dependency graphs. International Journal of Computer Vision, pp. 1\u201322, 2018.\n\n[29] Zhao, Yanxiang, Wang, Jiakou, Ma, Yanping, and Du, Qiang. Generalized local and nonlocal master\nequations for some stochastic processes. Computers & Mathematics with Applications, 71(11):2497\u20132512,\n2016.\n\n11\n\n\f", "award": [], "sourceid": 296, "authors": [{"given_name": "Yunzhe", "family_name": "Tao", "institution": "Columbia University"}, {"given_name": "Qi", "family_name": "Sun", "institution": "CSRC & USTC & CU"}, {"given_name": "Qiang", "family_name": "Du", "institution": "Columbia University"}, {"given_name": "Wei", "family_name": "Liu", "institution": "Tencent AI Lab"}]}