{"title": "Input Similarity from the Neural Network Perspective", "book": "Advances in Neural Information Processing Systems", "page_first": 5342, "page_last": 5351, "abstract": "Given a trained neural network, we aim at understanding how similar it considers any two samples. For this, we express a proper definition of similarity from the neural network perspective (i.e. we quantify how undissociable two inputs A and B are), by taking a machine learning viewpoint: how much a parameter variation designed to change the output for A would impact the output for B as well?\n\nWe study the mathematical properties of this similarity measure, and show how to estimate sample density with it, in low complexity, enabling new types of statistical analysis for neural networks. We also propose to use it during training, to enforce that examples known to be similar should also be seen as similar by the network.\n\nWe then study the self-denoising phenomenon encountered in regression tasks when training neural networks on datasets with noisy labels. We exhibit a multimodal image registration task where almost perfect accuracy is reached, far beyond label noise variance. Such an impressive self-denoising phenomenon can be explained as a noise averaging effect over the labels of similar examples. We analyze data by retrieving samples perceived as similar by the network, and are able to quantify the denoising effect without requiring true labels.", "full_text": "Input Similarity from the Neural Network Perspective\n\nGuillaume Charpiat1\n\nNicolas Girard2\n\nLoris Felardos1\n\nYuliya Tarabalka2,3\n\n1 TAU team, INRIA Saclay, LRI, Univ. Paris-Sud\n\n2 TITANE team, INRIA Sophia-Antipolis, Univ. C\u00f4te d\u2019Azur\n\n3 LuxCarta Technology\n\nfirstname.lastname@inria.fr\n\nAbstract\n\nGiven a trained neural network, we aim at understanding how similar it considers\nany two samples. For this, we express a proper de\ufb01nition of similarity from the\nneural network perspective (i.e. we quantify how undissociable two inputs A and\nB are), by taking a machine learning viewpoint: how much a parameter variation\ndesigned to change the output for A would impact the output for B as well?\nWe study the mathematical properties of this similarity measure, and show how to\nestimate sample density with it, in low complexity, enabling new types of statistical\nanalysis for neural networks. We also propose to use it during training, to enforce\nthat examples known to be similar should also be seen as similar by the network.\nWe then study the self-denoising phenomenon encountered in regression tasks when\ntraining neural networks on datasets with noisy labels. We exhibit a multimodal\nimage registration task where almost perfect accuracy is reached, far beyond label\nnoise variance. Such an impressive self-denoising phenomenon can be explained\nas a noise averaging effect over the labels of similar examples. We analyze data by\nretrieving samples perceived as similar by the network, and are able to quantify the\ndenoising effect without requiring true labels.\n\nIntroduction\n\n1\nThe notion of similarity between data points is an important topic in the machine learning literature,\nobviously in domains such as image retrieval, where images similar to a query have to be found;\nbut not only. For instance when training auto-encoders, the quality of the reconstruction is usually\nquanti\ufb01ed as the L2 norm between the input and output images. Such a similarity measure is however\nquestionable, as color comparison, performed pixel per pixel, is a poor estimate of human perception:\nthe L2 norm can vary a lot with transformations barely noticeable to the human eye such as small\ntranslations or rotations (for instance on textures), and does not carry semantic information, i.e.\nwhether the same kind of objects are present in the image. Therefore, so-called perceptual losses [10]\nwere introduced to quantify image similarity: each image is fed to a standard pre-trained network\nsuch as VGG, and the activations in a particular intermediate layer are used as descriptors of the\nimage [4, 5]. The distance between two images is then set as the L2 norm between these activations.\nSuch a distance carries implicitly semantic information, as the VGG network was trained for image\nclassi\ufb01cation. However, the choice of the layer to consider is arbitrary. In the ideal case, one would\nwish to combine information from all layers, as some are more abstract and some more detail-speci\ufb01c.\nThen, how to choose the weights to combine the different layers? Would it be possible to build a\ncanonical similarity measure, well posed theoretically?\nMore importantly, the previous literature does not consider the notion of input similarity from the\npoint of view of the neural network that is being used, but from the point of view of another one\n(typically, VGG) which aims at imitating human perception. Yet, neural networks are black boxes\ndif\ufb01cult to interpret, and showing which samples a network considers as similar would help to explain\nits decisions. Also, the number of such similar examples would be a key element for con\ufb01dence\nestimation at test time. Moreover, to explain the self-denoising phenomenon, i.e. why predictions can\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fbe far more accurate than the label noise magnitude in the training set, thanks to a noise averaging\neffect over similar examples [11], one needs to quantify similarity according to the network.\nThe purpose of this article is to express the notion of similarity from the network\u2019s point of view.\nWe \ufb01rst de\ufb01ne it, and study it mathematically, in Section 2, in the one-dimensional output case for\nthe sake of simplicity. Higher-dimensional outputs are dealt with in Section 3. We then compute,\nin Section 4, the number of neighbors (i.e., of similar samples), and propose for this a very fast\nestimator. This brings new tools to analyze already-trained networks. As they are differentiable and\nfast to compute, they can be used during training as well, e.g., to enforce that given examples should\nbe perceived as similar by the network (c.f . supp. mat.). Finally, in Section 5, we apply the proposed\ntools to analyze a network trained with noisy labels for a remote sensing image alignment task, and\nformalize the self-denoising phenomenon, quantifying its effect, extending [11] to real datasets.\n\n2 Similarity\nIn this section we de\ufb01ne a proper, intrinsic notion of similarity as seen by the network, relying on\nhow easily it can distinguish different inputs.\n\n2.1 Similarity from the point of view of the parameterized family of functions\nLet f\u03b8 be a parameterized function, typically a neural network already trained for some task, and x, x(cid:48)\npossible inputs, for instance from the training or test set. For the sake of simplicity, let us suppose in\na \ufb01rst step that f\u03b8 is real valued. To express the similarity between x and x(cid:48), as seen by the network,\none could compare the output values f\u03b8(x) and f\u03b8(x(cid:48)). This is however not very informative, and a\nsame output might be obtained for different reasons.\nInstead, we de\ufb01ne similarity as the in\ufb02uence of x over x(cid:48), by quantifying how much an additional\ntraining step for x would change the output for x(cid:48) as well. If x and x(cid:48) are very different from the\npoint of view of the neural network, changing f\u03b8(x) will have little consequence on f\u03b8(x(cid:48)). Vice\nversa, if they are very similar, changing f\u03b8(x) will greatly affect f\u03b8(x(cid:48)) as well.\n\nFigure 1: Moves in the space of out-\nputs. We quantify the in\ufb02uence of a\ndata point x over another one x(cid:48) by\nhow much the tuning of parameters \u03b8\nto obtain a desired output change v for\nf\u03b8(x) will affect f\u03b8(x(cid:48)) as well.\n\nFormally, if one wants to change the value of f\u03b8(x) by a small quantity \u03b5, one needs to update \u03b8 by\n\u03b4\u03b8 = \u03b5\n\n\u2207\u03b8f\u03b8(x)\n(cid:107)\u2207\u03b8f\u03b8(x)(cid:107)2 . Indeed, after the parameter update, the new value at x will be:\n\nf\u03b8+\u03b4\u03b8(x) = f\u03b8(x) + \u2207\u03b8f\u03b8(x) \u00b7 \u03b4\u03b8 + O((cid:107)\u03b4\u03b8(cid:107)2) = f\u03b8(x) + \u03b5 + O(\u03b52).\n\u2207\u03b8f\u03b8(x(cid:48)) \u00b7 \u2207\u03b8f\u03b8(x)\n\nThis parameter change induces a value change at any other point x(cid:48) :\nf\u03b8+\u03b4\u03b8(x(cid:48)) = f\u03b8(x(cid:48)) + \u2207\u03b8f\u03b8(x(cid:48)) \u00b7 \u03b4\u03b8 + O((cid:107)\u03b4\u03b8(cid:107)2) = f\u03b8(x(cid:48)) + \u03b5\n\n(cid:107)\u2207\u03b8f\u03b8(x)(cid:107)2\n\n+ O(\u03b52).\n\n\u2207\u03b8f\u03b8(x) \u00b7 \u2207\u03b8f\u03b8(x(cid:48))\n\n(cid:107)\u2207\u03b8f\u03b8(x)(cid:107)2\n\n\u03b8 (x, x(cid:48)) =\n\nrepresents the in\ufb02uence of x over x(cid:48): if one\n\u03b8 (x, x(cid:48)). In particular,\n\u03b8 (x, x(cid:48)) is high, then x and x(cid:48) are not distinguishable from the point of view of the network, as\n\u03b8 (x, x(cid:48)) as a measure\n\nTherefore the kernel kN\nwishes to change the output value f\u03b8(x) by \u03b5, then f\u03b8(x(cid:48)) will change by \u03b5 kN\nif kN\nany attempt to move f\u03b8(x) will move f\u03b8(x(cid:48)) as well (see Fig. 1). We thus see kN\nof similarity. Note however that kN\nSymmetric similarity: correlation Two symmetric kernels natural arise: the inner product:\n\n\u03b8 (x, x(cid:48)) is not symmetric.\n\nand its normalized version, the correlation:\n\u03b8 (x, x(cid:48)) =\nkC\n\n\u03b8 (x, x(cid:48)) = \u2207\u03b8f\u03b8(x) \u00b7 \u2207\u03b8f\u03b8(x(cid:48))\nkI\n\u2207\u03b8f\u03b8(x)\n(cid:107)\u2207\u03b8f\u03b8(x)(cid:107) \u00b7 \u2207\u03b8f\u03b8(x(cid:48))\n(cid:107)\u2207\u03b8f\u03b8(x(cid:48))(cid:107)\n\n(2)\nwhich has the advantage of being bounded (in [\u22121, 1]), thus expressing similarity in a usual meaning.\n\n(1)\n\n2\n\n\u03b8f (x\u2019)\u03b8vv\u2019f (x)\f2.2 Properties for vanilla neural networks\n\nIntuitively, inputs that are similar from the network perspective should produce similar outputs;\nwe can check that kC\n\u03b8 is a good similarity measure in this respect (all proofs are deferred to the\nsupplementary materials):\nTheorem 1. For any real-valued neural network f\u03b8 whose last layer is a linear layer (without any\nparameter sharing) or a standard activation function thereof (sigmoid, tanh, ReLU...), and for any\ninputs x and x(cid:48),\n\n\u2207\u03b8f\u03b8(x) = \u2207\u03b8f\u03b8(x(cid:48)) =\u21d2 f\u03b8(x) = f\u03b8(x(cid:48)) .\n\nCorollary 1. Under the same assumptions, for any inputs x and x(cid:48),\n\n\u03b8 (x, x(cid:48)) = 1 =\u21d2 \u2207\u03b8f\u03b8(x) = \u2207\u03b8f\u03b8(x(cid:48)) ,\nkC\n\u03b8 (x, x(cid:48)) = 1 =\u21d2 f\u03b8(x) = f\u03b8(x(cid:48)) .\nhence kC\n\nFurthermore,\nTheorem 2. For any real-valued neural network f\u03b8 without parameter sharing, if \u2207\u03b8f\u03b8(x) =\n\u2207\u03b8f\u03b8(x(cid:48)) for two inputs x, x(cid:48), then all useful activities computed when processing x are equal to the\nones obtained when processing x(cid:48).\nWe name useful activities all activities ai(x) whose variation would have an impact on the output,\n(cid:54)= 0. This condition is typically not satis\ufb01ed when the activity is\ni.e. all the ones satisfying df\u03b8(x)\ndai\nnegative and followed by a ReLU, or when it is multiplied by a 0 weight, or when all its contributions\nto the output cancel one another (e.g., a sum of two neurons with opposite weights: f\u03b8(x) =\n\u03c3(ai(x)) \u2212 \u03c3(ai(x))).\n\ni\n\nai(x), where wj\n\nf\u03b8(x) = df\u03b8(x)\ndbj\n\nLink with the perceptual loss For a vanilla network without parameter sharing, the gradient\n\u2207\u03b8f\u03b8(x) is a list of coef\ufb01cients \u2207wj\ni is the parameter-factor that\nmultiplies the input activation ai(x) in neuron j, and of coef\ufb01cients \u2207bj f\u03b8(x) = df\u03b8(x)\nfor neuron\nbiases, which we will consider as standard parameters bj = wj\n0 that act on a constant activation\na0(x). Thus the gradient \u2207\u03b8f\u03b8(x) can be seen as a list of all\na0(x) = 1, yielding \u2207wj\nactivation values ai(x) multiplied by the potential impact on the output f\u03b8(x) of the neurons j using\nthem, i.e. df\u03b8(x)\n. Each activation appears in this list as many times as it is fed to different neurons.\ndbj\nThe similarity between two inputs then rewrites:\n\nf\u03b8(x) = df\u03b8(x)\ndbj\n\ndbj\n\n0\n\n\u03b8 (x, x(cid:48)) =\nkI\n\n\u03bbi(x, x(cid:48)) ai(x) ai(x(cid:48))\n\nwhere\n\n\u03bbi(x, x(cid:48)) =\n\nactivities i\n\nneuron j using ai\n\ndf\u03b8(x)\n\ndbj\n\ndf\u03b8(x(cid:48))\ndbj\n\n(cid:88)\n\n(cid:88)\n\nare data-dependent importance weights. Such weighting schemes on activation units naturally\narise when expressing intrinsic quantities; the use of natural gradients would bring invariance to\nre-parameterization [16, 17]. On the other hand, the inner product related to the perceptual loss would\nbe\n\n\u03bblayer(i) ai(x) ai(x(cid:48))\n\n(cid:88)\n\nactivities i(cid:54)=0\n\nfor some arbitrary \ufb01xed layer-dependent weights \u03bblayer(i).\n\n2.3 Properties for parameter-sharing networks\nWhen sharing weights, as in convolutional networks, the gradient \u2207\u03b8f\u03b8(x) is made of the same\ncoef\ufb01cients (impact-weighted activations) but summed over shared parameters. Denoting by S(i) the\nset of (neuron, input activity) pairs where the parameter wi is involved,\n\n(cid:88)\n\n\uf8eb\uf8ed (cid:88)\n\n\uf8f6\uf8f8\uf8eb\uf8ed (cid:88)\n\n\uf8f6\uf8f8\n\nak(x(cid:48))\n\ndf\u03b8(x(cid:48))\ndbj\n\n\u03b8 (x, x(cid:48)) =\nkI\n\nak(x)\n\ndf\u03b8(x)\n\ndbj\n\n(j,k)\u2208Si\nThus, in convolutional networks, kI\n\u03b8 similarity does not imply similarity of \ufb01rst layer activations\nanymore, but only of their (impact-weighted) spatial average. More generally, any invariance\n\n(j,k)\u2208Si\n\nparams i\n\n3\n\n\fintroduced by a weight sharing scheme in an architecture will be re\ufb02ected in the similarity measure\n\u03b8 was de\ufb01ned as the input similarity from the neural network perspective.\n\u03b8, which is expected as kI\nkI\nNote that this type of objects was recently studied from an optimization viewpoint under the name of\nNeural Tangent Kernel [9, 1] in the in\ufb01nite layer width limit.\n\n3 Higher output dimension\n\nLet us now study the more complex case where f\u03b8(x) is a vector(cid:0)f i\n\u03b8(x)(cid:1)\n\ni\u2208[1,d] in Rd with d > 1.\nUnder a mild hypothesis on the network (output expressivity), always satis\ufb01ed unless specially\ndesigned not to:\nTheorem 3. The optimal parameter change \u03b4\u03b8 to push f\u03b8(x) in a direction v \u2208 Rd (with a force\n\u03b5 \u2208 R), i.e. such that f\u03b8+\u03b4\u03b8(x) \u2212 f\u03b8(x) = \u03b5v, induces at any other point x(cid:48) the following output\nvariation:\nwhere the d \u00d7 d kernel matrix K\u03b8(x(cid:48), x) is de\ufb01ned by K ij\nThe similarity kernel is now a matrix and not just a single value, as it describes the relation between\nmoves v \u2208 Rd. Note that these matrices K\u03b8 are only d \u00d7 d where d is the output dimension. They\nare thus generally small and easy to manipulate or inverse.\n\nf\u03b8+\u03b4\u03b8(x(cid:48)) \u2212 f\u03b8(x(cid:48)) = \u03b5 K\u03b8(x(cid:48), x) K\u03b8(x, x)\u22121 v + O(\u03b52)\n\n\u03b8 (x(cid:48), x) = \u2207\u03b8f i\n\n\u03b8(x(cid:48)) \u00b7 \u2207\u03b8f j\n\n\u03b8 (x).\n\n(3)\n\n\u03b8 (x, x(cid:48)) = K\u03b8(x, x)\u22121/2 K\u03b8(x, x(cid:48)) K\u03b8(x(cid:48), x(cid:48))\u22121/2 .\nK C\n\nNormalized similarity matrix The unitless symmetrized, normalized version of the kernel (3) is:\n(4)\nIt has the following properties: its coef\ufb01cients are bounded, in [\u22121, 1]; its trace is at most d; its\n(Frobenius) norm is at most\n\u03b8 (x, x) = Id; the kernel is\nsymmetric, in the sense that K C\n\n\u221a\nd; self-similarity is identity: \u2200x, K C\n\u03b8 (x(cid:48), x) = K C\n\n\u03b8 (x, x(cid:48))T .\n\nSimilarity in a single value To summarize the similarity matrix K C\nin [\u22121, 1], we consider:\n\n\u03b8 (x, x(cid:48)) into a single real value\n(5)\n\u03b8 (x, x(cid:48)) is close to Id, and recipro-\nIt can be shown indeed that if kC\ncally. See the supplementary materials for more details and a discussion about the links between\n1\nd Tr K C\n\n\u03b8 (x, x(cid:48)) .\n\u03b8 (x, x(cid:48)) =\nkC\n\u03b8 (x, x(cid:48)) is close to 1, then K C\n\n\u03b8 (x, x(cid:48)) and(cid:13)(cid:13)K C\n\n\u03b8 (x, x(cid:48)) \u2212 Id(cid:13)(cid:13)F .\n\nTr K C\n\n1\nd\n\nMetrics on output: rotation invariance Similarity in Rd might be richer than just estimating\ndistances in L2 norm. For instance, for our 2D image registration task, the network could be known\n(or desired) to be equivariant to rotations. The similarity between two output variations v and v(cid:48) can\nbe made rotation-invariant by applying the rotation that best aligns v and v(cid:48) beforehand. This can\nactually be easily computed in closed form and yields:\n\nkC,rot\n\u03b8\n\n(x, x(cid:48)) =\n\n1\n2\n\n(cid:113)(cid:13)(cid:13)K C\n\n\u03b8 (x, x(cid:48))(cid:13)(cid:13)2\n(cid:12)(cid:12)f\u03b8(x)(v). It has a particular meaning though, and is\n\nF + 2 det K C\n\n\u03b8 (x, x(cid:48)) .\n\nNote that other metrics are possible in the output space. For instance, the loss metric quanti\ufb01es the\nnorm of a move v by its impact on the loss dL(y)\ndy\nnot always relevant, e.g. in the noisy label case seen in Section 5.\n\nThe case of classi\ufb01cation tasks When the output of the network is a probability distribution\np\u03b8,x(c), over a \ufb01nite number of given classes c for example, it is natural from an information theoretic\n\u03b8 (x) = \u2212 log p\u03b8,x(c). This is actually the quantities computed in\npoint of view to rather consider f c\nthe pre-softmax layer from which common practice directly computes the cross-entropy loss.\nIt turns out that the L2 norm of variations \u03b4f in this space naturally corresponds to the Fisher\ninformation metric, which quanti\ufb01es the impact of parameter variations \u03b4\u03b8 on the output probability\nc,c(cid:48) and F\u03b8,x =\n\np\u03b8,x, as KL(p\u03b8,x||p\u03b8+\u03b4\u03b8,x). The matrices K\u03b8(x, x) = (cid:0)\u2207\u03b8f c\n(cid:2)\u2207\u03b8f c\n\n\u03b8 (x)T(cid:3) are indeed to each other what correlation is to covariance. Thus the\n\nEc\nquantities de\ufb01ned in Equation (5) already take into account information geometry when applied to\nthe pre-softmax layer, and do not need supplementary metric adjustment.\n\n\u03b8 (x)(cid:1)\n\n\u03b8 (x) \u00b7 \u2207\u03b8f c(cid:48)\n\n\u03b8 (x) \u2207\u03b8f c\n\n4\n\n\fFaster setup for classi\ufb01cation tasks with many classes\nIn a classi\ufb01cation task in d classes with\nlarge d, the computation of d \u00d7 d matrices may be prohibitive. As a workaround, for a given input\ntraining sample x, the classi\ufb01cation task can be seen as a binary one (the right label cR vs. the other\nones), in which case the d outputs of the neural network can be accordingly combined in a single real\nvalue. The 1D similarity measure can then be used to compare any training samples of the same class.\n\nWhen making statistics on similarity values Ex(cid:48)(cid:2)kC\n\n\u03b8 (x, x(cid:48))(cid:3), another possible task binarization\n\napproach is to sample an adversary class cA along with x(cid:48), and hence consider \u2207\u03b8f cR\n\u03b8 (x).\nBoth approaches will lead to similar results in the Enforcing Similarity section in the supplementary\nmaterials.\n\n\u03b8 (x)\u2212\u2207\u03b8f cA\n\n4 Estimating density\n\nIn this section, we use similarity to estimate input neighborhoods and perform statistics on them.\n\n4.1 Estimating the number of neighbors\nGiven a point x, how many samples x(cid:48) are similar to x according to the network? This can be\n\u03b8 (x, x(cid:48)) for all x(cid:48) and picking the closest ones, i.e. e.g. the x(cid:48) such that\nmeasured by computing kC\n\u03b8 (x, x(cid:48)) (cid:62) 0.9. More generally, for any data point x, the histogram of the similarity kC\n\u03b8 (x, x(cid:48)) over\nkC\nall x(cid:48) in the dataset (or a representative subset thereof) can be drawn, and turned into an estimate of\nthe number of neighbors of x. To do this, several types of estimates are possible:\n\n\u2022 hard-thresholding, for a given threshold \u03c4 \u2208 [0, 1]:\n\u2022 soft estimate:\n\u2022 less-soft positive-only estimate (\u03b1 > 0):\n\n\u03b8 (x,x(cid:48))(cid:62)\u03c4\n\u03b8 (x, x(cid:48))\n\u03b8 (x, x(cid:48))\u03b1\n\u03b8 is very rarely negative, and thus the soft estimate NS can be justi\ufb01ed\n\nx(cid:48) 1kC\nx(cid:48) kC\n\u03b8 (x,x(cid:48))>0 kC\n\n\u03b1 (x) = (cid:80)\n\nIn practice we observe that kC\nas an average of the hard-thresholding estimate N\u03c4 over all possible thresholds \u03c4:\n\nx(cid:48) 1kC\n\nN +\n\nN\u03c4 (x) =(cid:80)\nNS(x) = (cid:80)\n\nN\u03c4 (x)d\u03c4 =\n\n1kC\n\n\u03b8 (x,x(cid:48))(cid:62)\u03c4 d\u03c4 =\n\n\u03b8 (x, x(cid:48)) 1kC\nkC\n\n\u03b8 (x,x(cid:48))(cid:62)0 = N +\n\n1 (x) (cid:39) NS(x)\n\n(cid:88)\n\nx(cid:48)\n\n(cid:90) 1\n\n\u03c4 =0\n\n(cid:88)\n\n(cid:90) 1\n\n(cid:88)\n\nx(cid:48)\n\n\u03c4 =0\n\n(cid:88)\n\n4.2 Low complexity of the soft estimate NS(x)\n\nThe soft estimate NS(x) is rewritable as:\n\n(cid:88)\n\nx(cid:48)\n\n\u03b8 (x, x(cid:48)) =\nkC\n\n(cid:107)\u2207\u03b8f\u03b8(x(cid:48))(cid:107) =\n\n(cid:107)\u2207\u03b8f\u03b8(x)(cid:107) \u00b7 \u2207\u03b8f\u03b8(x(cid:48))\n\u2207\u03b8f\u03b8(x)\n\n\u2207\u03b8f\u03b8(x(cid:48))\n(cid:107)\u2207\u03b8f\u03b8(x(cid:48))(cid:107)\nx(cid:48)\nand consequently NS(x) can be computed jointly for all x in linear time O(|D|p) in the dataset\nsize |D| and in the number of parameters p, in just two passes over the dataset, when the output\ndimension is 1. For higher output dimensions d, a similar trick can be used and the complexity\nbecomes O(|D|d2p). For classi\ufb01cation tasks with a large number d of classes, the complexity can be\nreduced to O(|D|p) through an approximation consisting in binarizing the task (c.f . end of Section 3).\n\n\u2207\u03b8f\u03b8(x)\n(cid:107)\u2207\u03b8f\u03b8(x)(cid:107) \u00b7 g with g =\n\nx(cid:48)\n\n4.3 Test of the various estimators\n\nIn order to rapidly test the behavior of all possible estimators, we applied them to a toy problem\nwhere the network\u2019s goal is to predict a sinusoid. To change the dif\ufb01culty of the problem, we vary its\nfrequency, while keeping the number of samples constant. More details and results of the toy problem\nare in the supplementary materials. Fig. 2 shows for each estimator (with different parameters when\nrelevant), the result of their neighbor count estimation. When the frequency f of the sinusoid to\npredict increases, the number of neighbors decreases in 1\nf for every estimator. This aligns with our\nintuition that as the problem gets harder, the network needs to distinguish input samples more to\nachieve a good performance, thus the amount of neighbors is lower. In particular we observe that\nthe proposed NS(x) estimator behaves well, thus we will use that one in bigger studies requiring an\nef\ufb01cient estimator.\n\n5\n\n\fFigure 2: Density estimation using the various approaches (log scale). All approaches behave\nsimilarly and show good results, except the ones with extreme thresholds.\n\n4.4 Further potential uses for \ufb01tness estimation\n\nWhen the number of neighbors of a training point x is very low, the network is able to set any label to\nx, as this won\u2019t interfere with other points, by de\ufb01nition of our similarity criterion k\u03b8(x, x(cid:48)). This\nis thus a typical over\ufb01t case, where the network can learn by heart a label associated to a particular,\nisolated point.\nOn the opposite, when the set of neighbors of x is a large fraction of the dataset, comprising varied\nelements, by de\ufb01nition of k\u03b8(x, x(cid:48)) the network is not able to distinguish them, and consequently it\ncan only provide a common output for all of them. Therefore it might not be able to express variety\nenough, which would be a typical under\ufb01t case.\nThe quality of \ufb01t can thus be observed by monitoring the number of neighbors together with the\nvariance of the desired labels in the neighborhoods (to distinguish under\ufb01t from just high density).\n\nPrediction uncertainty A measure of the uncertainty of a prediction f\u03b8(x) could be to check how\neasy it would have been to obtain another value during training, without disturbing the training of\n(cid:107)\u2207\u03b8f\u03b8(x)(cid:107)2 v over other points x(cid:48) of the\nother points. A given change v of f\u03b8(x) induces changes\n(cid:107)\u2207\u03b8f\u03b8(x)(cid:107)2 v(cid:107). The uncertainty factor would then be\n\ndataset, creating a total L1 disturbance(cid:80)\n\nx(cid:48) (cid:107) kI\n\nthe norm of v affordable within a disturbance level, and quickly approximable as (cid:107)\u2207\u03b8f\u03b8(x)(cid:107)2\n\u03b8 (x,x(cid:48)).\n\nx(cid:48) kI\n\n\u03b8 (x,x(cid:48))\nkI\n\n\u03b8 (x,x(cid:48))\n\n(cid:80)\n\n5 Dataset self-denoising\n5.1 Motivation: example of remote sensing image registration with noisy labels\n\nIn remote sensing imagery, data is abundant but noisy [14]. For instance RGB satellite images\nand binary cadaster maps (delineating buildings) are numerous but badly aligned for various rea-\nsons (annotation mistakes, atmosphere disturbance, elevation variations...). In a recent preliminary\nwork [6], we tackled the task of automatically registering these two types of images together with\nneural networks, training on a dataset [13] with noisy annotations from OSM[18], and hoping the\nnetwork would be able to learn from such a dataset of imperfect alignments. Learning with noisy\nlabels is indeed an active topic of research [21, 15, 12].\nFor this, we designed an iterative approach: train, then use the outputs of the network on the training\nset to re-align it; repeat (for 3 iterations). The results were surprisingly good, yielding far better\nalignments than the ground truth it learned from, both qualitatively (Figure 3) and quantitatively\n(Figure 4, obtained on manually-aligned data): the median registration error dropped from 18 pixels to\n3.5 pixels, which is the best score one could hope for, given intrinsic ambiguities in such registration\ntask. To check that this performance was not due to a subset of the training data that would be\nperfectly aligned, we added noise to the ground truth and re-trained from it: the new results were\nabout as good again (dashed lines). Thus the network did learn almost perfectly just from noisy labels.\n\n6\n\n0.00.51.01.52.02.5Frequency (log)34567Neighbor count (log)Avg of all measures across all samplesneighbors_softneighbors_less_soft_n_2neighbors_less_soft_n_3neighbors_less_soft_n_4neighbors_hard_t_0.5neighbors_hard_t_0.6neighbors_hard_t_0.7neighbors_hard_t_0.8neighbors_hard_t_0.9neighbors_hard_t_0.925neighbors_hard_t_0.95neighbors_hard_t_0.975neighbors_hard_t_0.99\fFigure 3: Qualitative alignment results [6]\non a crop of bloomington22 from the Inria\ndataset [13]. Red: initial dataset annota-\ntions; blue: aligned annotations round 1;\ngreen: aligned annotations round 2.\n\nFigure 4: Accuracy cumulative distributions [6] mea-\nsured with the manually-aligned annotations of bloom-\nington22 [13]. Read as: fraction of image pixels whose\nregistration error is less than threshold \u03c4.\n\nAn explanation for this self-denoising phenomenon is proposed in [11] as follows. Let us consider a\nregression task, with a L2 loss, and where true labels y were altered with i.i.d. noise \u03b5 of variance v.\n\nSuppose a same input x appears n times in the training set, thus with n different labels(cid:101)yi = y + \u03b5i.\n(cid:80)\ni(cid:101)yi, whose distance to the\n\nThe network can only output the same prediction for all these n cases (since the input is the same),\nand the best option, considering the L2 loss, is to predict the average 1\nn\ntrue label y is O( v\u221a\nn can be observed. However, the exact\nsame point x is not likely to appear several times in a dataset (with different labels). Rather, relatively\nsimilar points may appear, and the amplitude of the self-denoising effect will be a function of their\nnumber. Here, the similarity should re\ufb02ect the neural network perception (similar inputs yield the\nsame output) and not an a priori norm chosen on the input space.\n\nn ). Thus a denoising effect by a factor\n\n\u221a\n\n5.2 Similarity experimentally observed between patches\n\nWe studied the multi-round training scheme of [6] by applying our similarity measure to a sampling\nof input patches of the training dataset for one network per round. The principle of the multiple\nround training scheme is to reduce the noise of the annotations, obtaining aligned annotations in\nthe end (more details in the supplementary materials). For a certain input patch, we computed its\nsimilarity with all the other patches for the 3 networks. With those similarities we can compute the\nnearest neighbors of that patch, see Fig. 5. The input patch is of a suburb area with sparse houses\nand individual trees. The closest neighbors look similar as they usually feature the same types of\nbuildings, building arrangement and vegetation. However sometimes the network sees a patch as\nsimilar when it is not clear from our point of view (for example patches with large buildings).\nFor more in-depth results, we computed the histogram of similarities for the same patch, see Fig. 6.\nWe observe that round 2 shows different neighborhood statistics, in that the patch is closer to all\nother patches than in other rounds. We observe the same behavior in 19 other input patches (see\nsuppl. materials). An hypothesis for this phenomenon is that the average gradient was not 0 at the end\nof that training round (due to optimization convergence issues, e.g.), which would shift all similarity\nhistograms by a same value.\nQualitatively, for patches randomly sampled, their similarity histograms tend to be approximately\nsymmetric in round 2, but with a longer left tail in round 1 and a longer right tail in round 3.\nNeighborhoods thus seem to change across the rounds, with fewer and fewer close points (if removing\nthe global histogram shift in round 2). A possible interpretation is that this would re\ufb02ect an increasing\nability of the network to distinguish between different patches, with \ufb01ner features in later training\nrounds.\n\n5.3 Comparison to the perceptual loss\n\nWe compare our approach to the perceptual loss on a nearest neighbor retrieval task. We notice that\nthe perceptual loss sometimes performs reasonably well, but often not. For instance, we show in\nFig. 7 the closest neighbors to a structured residential area image, for the perceptual loss (\ufb01rst row:\nnot making sense) and for our similarity measure (second row: similar areas).\n\n7\n\n\fFigure 5: Example of nearest neighbors for a patch. Each line corresponds to a round. Each patch\nhas its similarity written under it.\n\n(a) Round 1\n\n(b) Round 2\n\n(c) Round 3\n\nFigure 6: Histograms of similarities for one patch across rounds.\n\n5.4 From similarity statistics to self-denoising effect estimation\n\nWe now show how such similarity experimental computations can be used to solve the initial problem\nof Section 5, by explicitly turning similarity statistics into a quanti\ufb01cation of the self-denoising effect.\n\nLet us denote by yi the true (unknown) label for input xi, by(cid:101)yi the noisy label given in the dataset,\nand by(cid:98)yi = f\u03b8(xi) the label predicted by the network. We will denote the (unknown) noise by\n\u03b5i = (cid:101)yi \u2212 yi and assume it is centered and i.i.d., with \ufb01nite variance \u03c3\u03b5. The training criterion\nis E(\u03b8) = (cid:80)\nj ||(cid:98)yj \u2212(cid:101)yj||2. At convergence, the training leads to a local optimum of the energy\nlandscape: \u2207\u03b8E = 0, that is,(cid:80)\nj((cid:98)yj \u2212(cid:101)yj)\u2207\u03b8(cid:98)yj = 0. Let\u2019s choose any sample i and multiply by\n\u03b8 (xi, xj) = \u2207\u03b8(cid:98)yi.\u2207\u03b8(cid:98)yj , we get:\n\u2207\u03b8(cid:98)yi : using kI\n(cid:88)\n((cid:98)yj \u2212(cid:101)yj) kI\n\u03b8 (xj, xi)(cid:0)(cid:80)\n\n\u03b8 (xj, xi)(cid:1)\u22121 the column-normalized kernel, and\n\n\u03b8 (xj, xi) = 0.\n\n(xj, xi) = kI\n\nj kI\n\n(xj, xi) the mean value of a in the neighborhood of i, that is, the weighted\n\nby Ek[a] = (cid:80)\n\nLet us denote by kIN\n\n\u03b8\n\nj aj kIN\n\n\u03b8\n\nj\n\nl\na\nu\nt\np\ne\nc\nr\ne\nP\n\ny\nt\ni\nr\na\nl\ni\n\nm\nS\n\ni\n\nSource | Closest neighbor patches\n\nFigure 7: Closest neighbors to the leftmost patch, using the perceptual loss (\ufb01rst row) and our\nsimilarity de\ufb01nition (second row).\n\n8\n\n0.00.20.40.60.81.0Similarity0.00.10.20.30.40.50.60.70.8FrequencyNeighbors soft: 1276.70.00.20.40.60.81.0Similarity0.00.10.20.30.40.50.60.70.8FrequencyNeighbors soft: 2486.30.00.20.40.60.81.0Similarity0.00.10.20.30.40.50.60.70.8FrequencyNeighbors soft: 394.6\fk\n\nk\n\n[y] = E\nk\n\n(cid:98)yi \u2212 E\n\nfrom the average prediction in its neighborhood.\n\naverage of the aj with weights kI\n\u03b8 (xj, xi) normalized to sum up to 1. This is actually a kernel\nregression, in the spirit of Parzen-Rosenblatt window estimators. Then the previous property can be\n\nrewritten as Ek[(cid:98)y] = Ek[(cid:101)y] . As Ek[(cid:101)y] = Ek[y] + Ek[\u03b5] , this yields:\n[\u03b5] + ((cid:98)yi \u2212 E\n[(cid:98)y])\ni.e. the difference between the predicted(cid:98)yi and the average of the true labels in the neighborhood of i\nis equal to the average of the noise in the neighborhood of i, up to the deviation of the prediction(cid:98)yi\nWe want to bound the error (cid:107)(cid:98)yi \u2212 Ek[y](cid:107) without knowing neither the true labels y nor the noise \u03b5.\n\nOne can show that Ek[\u03b5] \u221d var\u03b5(Ek[\u03b5])1/2 = \u03c3\u03b5 (cid:107)kIN\nsimilarity kernel norm (cid:107)kIN\nhood quality. It is 1/\nthe other extreme, this factor is 1 when all points are independent: kI\nway we extend noise2noise [11] to real datasets with non-identical inputs.\nIn our remote sensing experiment, we estimate this way a denoising factor of 0.02, consistent across\nall training rounds and inputs (\u00b110%), implying that each training round contributed equally to\ndenoising the labels. This is con\ufb01rmed by Fig. 4, which shows the error steadily decreasing, on a\n\n(\u00b7, xi)(cid:107)L2. The denoising factor is thus the\nN and 1, depending on the neighbor-\n\u03b8 (xi, xj) = 1. On\n\u03b8 (xi, xj) = 0 \u2200i (cid:54)= j. This\n\ncontrol test where true labels are known. The shift ((cid:98)yi \u2212 Ek[(cid:98)y]) on the other hand can be directly\n\nN when all N data points are identical, i.e. all satisfying kC\n\n(\u00b7, xi)(cid:107)L2, which is between 1/\n\n\u221a\n\n\u221a\n\n\u03b8\n\n\u03b8\n\nestimated given the network prediction. In our case, it is 4.4px on average, which is close to the\nobserved median error for the last round in Fig. 4. It is largely input-dependent, with variance 3.2px,\nwhich is re\ufb02ected by the spread distribution of errors in Fig. 4. This input-dependent shift thus\nprovides a hint about prediction reliability.\n\nIt is also possible to bound ((cid:98)yi \u2212 Ek[(cid:98)y]) = Ek[(cid:98)yi \u2212(cid:98)y] using only similarity information (without\npredictions(cid:98)y). Theorem 1 implies that the application: \u2207\u03b8f\u03b8(x)\n(cid:107)\u2207\u03b8f\u03b8(x)(cid:107) (cid:55)\u2192 f\u03b8(x) is well-de\ufb01ned, and it can\n(cid:13)(cid:13)(cid:13)(cid:13) =\n(cid:13)(cid:13)(cid:13)(cid:13) \u2207\u03b8f\u03b8(x)\n(cid:113)\nactually be shown to be Lipschitz with a network-dependent constant (under mild hypotheses). Thus\n(cid:107)\u2207\u03b8f\u03b8(x)(cid:107) \u2212 \u2207\u03b8f\u03b8(x(cid:48))\n(cid:107)\u2207\u03b8f\u03b8(x(cid:48))(cid:107)\n\u03b8 (xi, xj) and thus(cid:12)(cid:12) Ek[(cid:98)yi \u2212(cid:98)y](cid:12)(cid:12) (cid:54) \u221a\n\nyielding (cid:107)(cid:98)yi \u2212(cid:98)yj(cid:107) (cid:54) \u221a\n\n(cid:107)f\u03b8(x) \u2212 f\u03b8(x(cid:48))(cid:107) (cid:54) C\n\n\u03b8 (x, x(cid:48)) ,\n\n\u03b8 (xi,\u00b7)\n\n(cid:104)(cid:113)\n\n1 \u2212 kC\n\n1 \u2212 kC\n\n1 \u2212 kC\n\n(cid:113)\n\n2C Ek\n\n(cid:105)\n\n\u221a\n\n2C\n\n2C\n\n.\n\n6 Conclusion\n\nWe de\ufb01ned a proper notion of input similarity as perceived by the neural network, based on the ability\nof the network to distinguish the inputs. This brings a new tool to analyze trained networks, in plus\nof visualization tools such as grad-CAM [20]. We showed how to turn it into a density estimator,\nwhich was validated on a controlled experiment, and usable to perform fast statistics on large datasets.\nIt opens the door to under\ufb01t/over\ufb01t/uncertainty analyses or even control during training, as it is\ndifferentiable and computable at low cost.\nIn the supplementary materials, we go further in that direction and show that, if two or more samples\nare known to be similar (from a human point of view), it is possible to incite the network, during\ntraining, to evolve in order to consider these samples as similar. We notice an associated dataset-\ndependent boosting effect that should be further studied along with robustness to adversarial attacks,\nas such training differs signi\ufb01cantly from usual methods.\nFinally, we extended noise2noise [11] to the case of non-identical inputs, thus expressing self-\ndenoising effects as a function of inputs\u2019 similarities.\nThe code is available at https://github.com/Lydorn/netsimilarity .\n\nAcknowledgments\n\nWe thank Victor Berger and Adrien Bousseau for useful discussions. This work bene\ufb01ted from the sup-\nport of the project EPITOME ANR-17-CE23-0009 of the French National Research Agency (ANR).\n\n9\n\n\fReferences\n[1] Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming. arXiv\n\npreprint arXiv:1812.07956, 2018.\n\n[2] Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on\n\nmachine learning, pages 2990\u20132999, 2016.\n\n[3] Harris Drucker and Yann Le Cun. Double backpropagation increasing generalization performance. In\nIJCNN-91-Seattle International Joint Conference on Neural Networks, volume 2, pages 145\u2013150. IEEE,\n1991.\n\n[4] Leon Gatys, Alexander S Ecker, and Matthias Bethge. Texture synthesis using convolutional neural\n\nnetworks. In Advances in neural information processing systems, pages 262\u2013270, 2015.\n\n[5] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style. arXiv preprint\n\narXiv:1508.06576, 2015.\n\n[6] Nicolas Girard, Guillaume Charpiat, and Yuliya Tarabalka. Noisy Supervision for Correcting Misaligned\nCadaster Maps Without Perfect Ground Truth Data. In IGARSS, July 2019. URL https://hal.inria.\nfr/hal-02065211.\n\n[7] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved\ntraining of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767\u20135777,\n2017.\n\n[8] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Simplifying neural nets by discovering \ufb02at minima. In Advances\n\nin neural information processing systems, pages 529\u2013536, 1995.\n\n[9] Arthur Jacot, Franck Gabriel, and Cl\u00e9ment Hongler. Neural tangent kernel: Convergence and generalization\n\nin neural networks. In Advances in neural information processing systems, pages 8571\u20138580, 2018.\n\n[10] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and\n\nsuper-resolution. In European conference on computer vision, pages 694\u2013711. Springer, 2016.\n\n[11] Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo\nAila. Noise2noise: Learning image restoration without clean data. In International Conference on Machine\nLearning, pages 2971\u20132980, 2018.\n\n[12] Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. Learning from noisy\nlabels with distillation. In Proceedings of the IEEE International Conference on Computer Vision, pages\n1910\u20131918, 2017.\n\n[13] Emmanuel Maggiori, Yuliya Tarabalka, Guillaume Charpiat, and Pierre Alliez. Can semantic labeling\n\nmethods generalize to any city? the Inria aerial image labeling benchmark. In IGARSS, 2017.\n\n[14] Volodymyr Mnih and Geoffrey E Hinton. Learning to label aerial images from noisy data. In Proceedings\n\nof the 29th International conference on machine learning (ICML-12), pages 567\u2013574, 2012.\n\n[15] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy\n\nlabels. In Advances in neural information processing systems, pages 1196\u20131204, 2013.\n\n[16] Yann Ollivier. Riemannian metrics for neural networks I: feedforward networks.\n\nInformation and\nInference: A Journal of the IMA, 4(2):108\u2013153, 03 2015. ISSN 2049-8772. doi: 10.1093/imaiai/iav006.\nURL https://doi.org/10.1093/imaiai/iav006.\n\n[17] Yann Ollivier. Riemannian metrics for neural networks II: recurrent networks and learning symbolic data\nsequences. Information and Inference: A Journal of the IMA, 4(2):154\u2013193, 03 2015. ISSN 2049-8772.\ndoi: 10.1093/imaiai/iav007. URL https://doi.org/10.1093/imaiai/iav007.\n\n[18] OpenStreetMap contributors. Planet dump retrieved from https://planet.osm.org , 2017.\n\n[19] Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders:\nExplicit invariance during feature extraction. In Proceedings of the 28th International Conference on\nInternational Conference on Machine Learning, pages 833\u2013840. Omnipress, 2011.\n\n[20] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and\nDhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In\nProceedings of the IEEE International Conference on Computer Vision, pages 618\u2013626, 2017.\n\n[21] Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolu-\n\ntional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014.\n\n10\n\n\f", "award": [], "sourceid": 2880, "authors": [{"given_name": "Guillaume", "family_name": "Charpiat", "institution": "INRIA"}, {"given_name": "Nicolas", "family_name": "Girard", "institution": "Inria Sophia-Antipolis"}, {"given_name": "Loris", "family_name": "Felardos", "institution": "INRIA"}, {"given_name": "Yuliya", "family_name": "Tarabalka", "institution": "Inria Sophia-Antipolis"}]}