{"title": "Towards Understanding Learning Representations: To What Extent Do Different Neural Networks Learn the Same Representation", "book": "Advances in Neural Information Processing Systems", "page_first": 9584, "page_last": 9593, "abstract": "It is widely believed that learning good representations is one of the main reasons for the success of deep neural networks. Although highly intuitive, there is a lack of theory and systematic approach quantitatively characterizing what representations do deep neural networks learn. In this work, we move a tiny step towards a theory and better understanding of the representations. Specifically, we study a simpler problem: How similar are the representations learned by two networks with identical architecture but trained from different initializations. We develop a rigorous theory based on the neuron activation subspace match model. The theory gives a complete characterization of the structure of neuron activation subspace matches, where the core concepts are maximum match and simple match which describe the overall and the finest similarity between sets of neurons in two networks respectively. We also propose efficient algorithms to find the maximum match and simple matches. Finally, we conduct extensive experiments using our algorithms. Experimental results suggest that, surprisingly, representations learned by the same convolutional layers of networks trained from different initializations are not as similar as prevalently expected, at least in terms of subspace match.", "full_text": "Towards Understanding Learning Representations:\nTo What Extent Do Different Neural Networks Learn\n\nthe Same Representation\n\nLiwei Wang1,2 Lunjia Hu3\n\nJiayuan Gu1 Yue Wu1\n\nZhiqiang Hu1 Kun He4\n\nJohn Hopcroft5\n\n1Key Laboratory of Machine Perception, MOE, School of EECS, Peking University\n2Center for Data Science, Peking University, Beijing Institute of Big Data Research\n\n3Computer Science Department, Stanford University\n4Huazhong University of Science and Technology\n\n5Cornell University\n\nwanglw@cis.pku.edu.cn\n\nlunjia@stanford.edu\n\n{gujiayuan, frankwu, huzq}@pku.edu.cn\n\nbrooklet60@hust.edu.cn, jeh17@cornell.edu\n\nAbstract\n\nIt is widely believed that learning good representations is one of the main reasons\nfor the success of deep neural networks. Although highly intuitive, there is a lack of\ntheory and systematic approach quantitatively characterizing what representations\ndo deep neural networks learn.\nIn this work, we move a tiny step towards a\ntheory and better understanding of the representations. Speci\ufb01cally, we study a\nsimpler problem: How similar are the representations learned by two networks\nwith identical architecture but trained from different initializations. We develop\na rigorous theory based on the neuron activation subspace match model. The\ntheory gives a complete characterization of the structure of neuron activation\nsubspace matches, where the core concepts are maximum match and simple match\nwhich describe the overall and the \ufb01nest similarity between sets of neurons in two\nnetworks respectively. We also propose ef\ufb01cient algorithms to \ufb01nd the maximum\nmatch and simple matches. Finally, we conduct extensive experiments using our\nalgorithms. Experimental results suggest that, surprisingly, representations learned\nby the same convolutional layers of networks trained from different initializations\nare not as similar as prevalently expected, at least in terms of subspace match.\n\n1\n\nIntroduction\n\nIt is widely believed that learning good representations is one of the main reasons for the success of\ndeep neural networks [Krizhevsky et al., 2012, He et al., 2016]. Taking CNN as an example, \ufb01lters,\nshared weights, pooling and composition of layers are all designed to learn good representations of\nimages. Although highly intuitive, it is still illusive what representations do deep neural networks\nlearn.\nIn this work, we move a tiny step towards a theory and a systematic approach that characterize\nthe representations learned by deep nets.\nIn particular, we consider a simpler problem: How\nsimilar are the representations learned by two networks with identical architecture but trained from\ndifferent initializations. It is observed that training the same neural network from different random\ninitializations frequently yields similar performance [Dauphin et al., 2014]. A natural question arises:\ndo the differently-initialized networks learn similar representations as well, or do they learn totally\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fdistinct representations, for example describing the same object from different views? Moreover,\nwhat is the granularity of similarity: do the representations exhibit similarity in a local manner,\ni.e. a single neuron is similar to a single neuron in another network, or in a distributed manner, i.e.\nneurons aggregate to clusters that collectively exhibit similarity? The questions are central to the\nunderstanding of the representations learned by deep neural networks, and may shed light on the\nlong-standing debate about whether network representations are local or distributed.\nLi et al. [2016] studied these questions from an empirical perspective. Their approach breaks down\nthe concept of similarity into one-to-one mappings, one-to-many mappings and many-to-many\nmappings, and probes each kind of mappings by ad-hoc techniques. Speci\ufb01cally, they applied linear\ncorrelation and mutual information analysis to study one-to-one mappings, and found that some\ncore representations are shared by differently-initialized networks, but some rare ones are not; they\napplied a sparse weighted LASSO model to study one-to-many mappings and found that the whole\ncorrespondence can be decoupled to a series of correspondences between smaller neuron clusters;\nand \ufb01nally they applied a spectral clustering algorithm to \ufb01nd many-to-many mappings.\nAlthough Li et al. [2016] provide interesting insights, their approach is somewhat heuristic, especially\nfor one-to-many mappings and many-to-many mappings. We argue that a systematic investigation\nmay deliver a much more thorough comprehension. To this end, we develop a rigorous theory\nto study the questions. We begin by modeling the similarity between neurons as the matches of\nsubspaces spanned by activation vectors of neurons. The activation vector [Raghu et al., 2017]\nshows the neuron\u2019s responses over a \ufb01nite set of inputs, acting as the representation of a single\nneuron.1 Compared with other possible representations such as the weight vector, the activation\nvector characterizes the essence of the neuron as an input-output function, and takes into consideration\nthe input distribution. Further, the representation of a neuron cluster is represented by the subspace\nspanned by activation vectors of neurons in the cluster. The subspace representations derive from the\nfact that activations of neurons are followed by af\ufb01ne transformations; two neuron clusters whose\nactivations differ up to an af\ufb01ne transformation are essentially learning the same representations.\nIn order to develop a thorough understanding of the similarity between clusters of neurons, we give a\ncomplete characterization of the structure of the neuron activation subspace matches. We show the\nunique existence of the maximum match, and we prove the Decomposition Theorem: every match\ncan be decomposed as the union of a set of simple matches, where simple matches are those which\ncannot be decomposed any more. The maximum match characterizes the whole similarity, while\nsimple matches represent minimal units of similarity, collectively giving a complete characterization.\nFurthermore, we investigate how to characterize these simple matches so that we can develop ef\ufb01cient\nalgorithms for \ufb01nding them.\nFinally, we conduct extensive experiments using our algorithms. We analyze the size of the maximum\nmatch and the distribution of sizes of simple matches. It turns out, contrary to prevalently expected,\nrepresentations learned by almost all convolutional layers exhibit very low similarity in terms of\nmatches. We argue that this observation re\ufb02ects the current understanding of learning representation\nis limited.\nOur contributions are summarized as follows.\n\n1. We develop a theory based on the neuron activation subspace match model to study the\nsimilarity between representations learned by two networks with identical architecture\nbut trained from different initializations. We give a complete analysis for the structure of\nmatches.\n\n2. We propose ef\ufb01cient algorithms for \ufb01nding the maximum match and the simple matches,\n\nwhich are the central concepts in our theory.\n\n3. Experimental results demonstrate that representations learned by most convolutional layers\n\nexhibit low similarity in terms of subspace match.\n\nThe rest of the paper is organized as follows. In Section 2 we formally describe the neuron activation\nsubspace match model. Section 3 will present our theory of neuron activation subspace match. Based\non the theory, we propose algorithms in Section 4. In Section 5 we will show experimental results\nand make analysis. Finally, Section 6 concludes. Due to the limited space, all proofs are given in the\nsupplementary.\n\n1Li et al. [2016] also implicitly used the activation vector as the neuron\u2019s representation.\n\n2\n\n\f2 Preliminaries\n\nIn this section, we will formally describe the neuron activation subspace match model that will\nbe analyzed throughout this paper. Let X and Y be the set of neurons in the same layer2 of two\nnetworks with identical architecture but trained from different initializations. Suppose the networks\nare given d input data a1, a2,\u00b7\u00b7\u00b7 , ad. For \u2200v \u2208 X \u222a Y, let the output of neuron v over ai be\nzv(ai). The representation of a neuron v is measured by the activation vector [Raghu et al., 2017]\nof the neuron v over the d inputs, zv := (zv(a1), zv(a2),\u00b7\u00b7\u00b7 , zv(ad)). For any subset X \u2286 X , we\ndenote the vector set {zx : x \u2208 X} by zX for short. The representation of a subset of neurons\nX \u2286 X is measured by the subspace spanned by the activation vectors of the neurons therein,\n\u03bbzx zx : \u2200\u03bbzx \u2208 R}. Similarly for Y \u2286 Y. In particular, the representation of\n\nzx\u2208zX\n\nspan(zX ) := { (cid:80)\nlayer of X , we have z\u02dcx(ai) = ReLU((cid:80)\n{wx : x \u2208 X} there exists {wy : y \u2208 Y } such that \u2200ai,(cid:80)\n\nan empty subset is span(\u2205) := {0}, where 0 is the zero vector in Rd.\nThe reason why we adopt the neuron activation subspace as the representation of a subset of neurons\nis that activations of neurons are followed by af\ufb01ne transformations. For any neuron \u02dcx in the following\nx\u2208X wxzx(ai) + b), where {wx : x \u2208 X} and b are the\nparameters. Similarly for neuron \u02dcy in the following layer of Y. If span(zX ) = span(zY ), for any\ny\u2208Y wyzy(ai), and\n\nx\u2208X wxzx(ai) =(cid:80)\n\nvice versa. Essentially \u02dcx and \u02dcy receive the same information from either X or Y .\nWe now give the formal de\ufb01nition of a match.\nDe\ufb01nition 1 (\u0001-approximate match and exact match). Let X \u2286 X and Y \u2286 Y be two subsets of\nneurons. \u2200\u0001 \u2208 [0, 1), we say (X, Y ) forms an \u0001-approximate match in (X ,Y), if\n\n1. \u2200x \u2208 X, dist(zx, span(zY )) \u2264 \u0001|zx|,\n2. \u2200y \u2208 Y, dist(zy, span(zX )) \u2264 \u0001|zy|.\n\nHere we use the L2 distance: for any vector z and any subspace S, dist(z, S) = minz(cid:48)\u2208S (cid:107)z \u2212 z(cid:48)(cid:107)2.\nWe call a 0-approximate match an exact match. Equivalently, (X, Y ) is an exact match if span(zX ) =\nspan(zY ).\n\n3 A Theory of Neuron Activation Subspace Match\n\nIn this section, we will develop a theory which gives a complete characterization of the neuron\nactivation subspace match problem. For two sets of neurons X ,Y in two networks, we show the\nstructure of all the matches (X, Y ) in (X ,Y). It turns out that every match (X, Y ) can be decomposed\nas a union of simple matches, where a simple match is an atomic match that cannot be decomposed\nany further.\nSimple match is the most important concept in our theory. If there are many one-to-one simple\nmatches (i.e. |X| = |Y | = 1) , it implies that the two networks learn very similar representations at\nthe neuron level. On the other hand, if all the simple matches have very large size (i.e. |X|,|Y | are\nboth large), it is reasonable to say that the two networks learn different representations, at least in\ndetails.\nWe will give mathematical characterization of the simple matches. This allows us to design ef\ufb01cient\nalgorithms \ufb01nding out the simple matches (Sec.4). The structures of exact and approximate match\nare somewhat different. In Section 3.1, we present the simpler case of exact match, and in Section\n3.2, we describe the more general \u0001-approximate match. Without being explicitly stated, when we\nsay match, we mean \u0001-approximate match.\nWe begin with a lemma stating that matches are closed under union.\nLemma 2 (Union-Close Lemma). Let (X1, Y1) and (X2, Y2) be two \u0001-approximate matches in\n(X ,Y). Then (X1 \u222a X2, Y1 \u222a Y2) is still an \u0001-approximate match.\nThe fact that matches are closed under union implies that there exists a unique maximum match.\nDe\ufb01nition 3 (Maximum Match). A match (X\u2217, Y \u2217) in (X ,Y) is the maximum match if every match\n(X, Y ) in (X ,Y) satis\ufb01es X \u2286 X\u2217 and Y \u2286 Y \u2217.\n\n2In this paper we focus on neurons of the same layer. But the method applies to an arbitrary set of nerons.\n\n3\n\n\fThe maximum match is simply the union of all matches. In Section 4 we will develop an ef\ufb01cient\nalgorithm that \ufb01nds the maximum match.\nNow we are ready to give a complete characterization of all the matches. First, we point out that there\ncan be exponentially many matches. Fortunately, every match can be represented as the union of some\nsimple matches de\ufb01ned below. The number of simple matches is polynomial for the setting of exact\nmatch given (zx)x\u2208X and (zy)y\u2208Y being both linearly independent, and under certain conditions for\napproximate match as well.\nDe\ufb01nition 4 (Simple Match). A match ( \u02c6X, \u02c6Y ) in (X ,Y) is a simple match if \u02c6X \u222a \u02c6Y is non-empty\nand there exist no matches (Xi, Yi) in (X ,Y) such that\n\n2. \u02c6X =(cid:83)\n\nXi, \u02c6Y =(cid:83)\n\n1. \u2200i, (Xi \u222a Yi) (cid:40) ( \u02c6X \u222a \u02c6Y );\n\nYi.\n\ni\n\ni\n\nWith the concept of the simple matches, we will show the Decomposition Theorem: every match\ncan be decomposed as the union of a set of simple matches. Consequently, simple matches fully\ncharacterize the structure of matches.\nTheorem 5 (Decomposition Theorem). Every match (X, Y ) in (X ,Y) can be expressed as a union\n\u02c6Yi.\n\nof simple matches. Formally, there are simple matches ( \u02c6Xi, \u02c6Yi) satisfying X =(cid:83)\n\n\u02c6Xi and Y =(cid:83)\n\ni\n\ni\n\n3.1 Structure of Exact Matches\n\nThe main goal of this and the next subsection is to understand the simple matches. The de\ufb01nition\nof simple match only tells us it cannot be decomposed. But how to \ufb01nd the simple matches? How\nmany simple matches exist? We will answer these questions by giving a characterization of the\nsimple match. Here we consider the setting of exact match, which has a much simpler structure than\napproximate match.\nAn important property for exact match is that matches are closed under intersection.\nLemma 6 (Intersection-Close Lemma). Assume (zx)x\u2208X and (zy)y\u2208Y are both linearly independent.\nLet (X1, Y1) and (X2, Y2) be exact matches in (X ,Y). Then, (X1 \u2229 X2, Y1 \u2229 Y2) is still an exact\nmatch.\n\nIt turns out that in the setting of exact match, simple matches can be explicitly characterized by\nv-minimum match de\ufb01ned below.\nDe\ufb01nition 7 (v-Minimum Match). Given a neuron v \u2208 X \u222a Y, we de\ufb01ne the v-minimum match to\nbe the exact match (Xv, Yv) in (X ,Y) satisfying the following properties:\n\n1. v \u2208 Xv \u222a Yv;\n2. any exact match (X, Y ) in (X ,Y) with v \u2208 X \u222a Y satis\ufb01es Xv \u2286 X and Yv \u2286 Y .\n\nEvery neuron v in the maximum match (X\u2217, Y \u2217) has a unique v-minimum match, which is the\nintersection of all matches that contain v. For a neuron v not in the maximum match, there is no\nv-minimum match because there is no match containing v.\nThe following theorem states that the simple matches are exactly v-minimum matches.\nTheorem 8. Assume (zx)x\u2208X and (zy)y\u2208Y are both linearly independent. Let (X\u2217, Y \u2217) be the\nmaximum (exact) match in (X ,Y). \u2200v \u2208 X\u2217 \u222a Y \u2217, the v-minimum match is a simple match, and\nevery simple match is a v-minimum match for some neuron v \u2208 X\u2217 \u222a Y \u2217.\nTheorem 8 implies that the number of simple exact matches is at most linear with respect to the\nnumber of neurons given the activation vectors being linearly independent, because the v-minimum\nmatch for each neuron v is unique. We will give a polynomial time algorithm in Section 4 to \ufb01nd out\nall the v-minimum matches.\n\n3.2 Structure of Approximate Matches\n\nThe structure of \u0001-approximate match is more complicated than exact match. A major difference is\nthat in the setting of approximate matches, the intersection of two matches is not necessarily a match.\nAs a consequence, there is no v-minimum match in general. Instead, we have v-minimal match.\n\n4\n\n\fDe\ufb01nition 9 (v-Minimal Match). v-minimal matches are matches (Xv, Yv) in (X ,Y) with the\nfollowing properties:\n1. v \u2208 Xv \u222a Yv;\n2. if a match (X, Y ) with X \u2286 Xv and Y \u2286 Yv satis\ufb01es v \u2208 X \u222a Y , then (X, Y ) = (Xv, Yv).\nDifferent from the setting of exact match where v-minimum match is unique for a neuron v, there\nmay be multiple v-minimal matches for v in the setting of approximate match, and in this setting\nsimple matches can be characterized by v-minimal matches instead. Again, for any neuron v not in\nthe maximum match (X\u2217, Y \u2217), there is no v-minimal match because no match contains v.\nTheorem 10. Let (X\u2217, Y \u2217) be the maximum match in (X ,Y). \u2200v \u2208 X\u2217 \u222a Y \u2217, every v-minimal\nmatch is a simple match, and every simple match is a v-minimal match for some v \u2208 X\u2217 \u222a Y \u2217.\nRemark 1. We use the notion v-minimal match for v \u2208 X \u222a Y. That is, the neuron can be in either\nnetworks. We emphasize that this is necessary. Restricting v \u2208 X (or v \u2208 Y) does not yield Theorem\n10 anymore. In other word, v-minimal matches for v \u2208 X do not represent all simple matches. See\nRemark A.1 in the Supplementary Material for details.\nRemark 2. One may have the impression that the structure of match is very simple. This is not\nexactly the case. Here we point out the complicated aspect:\n\n1. Matches are not closed under the difference operation, even for exact matches. More gener-\nally, let (X1, Y1) and (X2, Y2) be two matches with X1 (cid:40) X2, Y1 (cid:40) Y2. (X2\\X1, Y2\\Y1)\nis not necessarily a match.\n2. The decomposition of a match into the union of simple matches is not necessarily unique.\n\nSee Section C in the Supplementary Material for details.\n\n4 Algorithms\n\nIn this section, we will give an ef\ufb01cient algorithm that \ufb01nds the maximum match. Based on this\nalgorithm, we further give an algorithm that \ufb01nds all the simple matches, which are precisely the\nv-minimum/minimal matches as shown in the previous section. The algorithm for \ufb01nding the\nmaximum match is given in Algorithm 1. Initially, we guess the maximum match (X\u2217, Y \u2217) to be\nX\u2217 = X , Y \u2217 = Y. If there is x \u2208 X\u2217 such that dist(zx, span(zY \u2217 )) > \u0001, then we remove x from\nX\u2217. Similarly, if for some y \u2208 Y \u2217 such that y cannot be linearly expressed by zX\u2217 within error \u0001,\nthen we remove y from Y \u2217. X\u2217 and Y \u2217 are repeatedly updated in this way until no such x, y can be\nfound.\n\nAlgorithm 1 max_match((zv(cid:48))v(cid:48)\u2208X\u222aY , \u0001)\n1: (X\u2217, Y \u2217) \u2190 (X ,Y)\n2: changed \u2190 true\n3: while changed do\nchanged \u2190 false\n4:\nfor x \u2208 X\u2217 do\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15: return (X\u2217, Y \u2217)\n\nif changed then\nchanged \u2190 false\nfor y \u2208 Y \u2217 do\n\nY \u2217 \u2190 Y \u2217\\{y}\nchanged \u2190 true\n\nif dist(zx, span(zY \u2217 )) > \u0001 then\n\nX\u2217 \u2190 X\u2217\\{x}\nchanged \u2190 true\n\nif dist(zy, span(zX\u2217 )) > \u0001 then\n\nTheorem 11. Algorithm 1 outputs the maximum match and runs in polynomial time.\nOur next algorithm (Algorithm 2) is to output, for a given neuron v \u2208 X \u222a Y, the v-minimum match\n(for exact match given the activation vectors being linearly independent) or one v-minimal match (for\n\n5\n\n\fapproximate match). The algorithm starts from (Xv, Yv) being the maximum match and iteratively\n\ufb01nds a smaller match (Xv, Yv) keeping v \u2208 Xv \u222a Yv until further reducing the size of (Xv, Yv)\nwould have to violate v \u2208 Xv \u222a Yv.\n\nAlgorithm 2 min_match((zv(cid:48))v(cid:48)\u2208X\u222aY , v, \u0001)\n1: (Xv, Yv) \u2190 max_match((zv(cid:48))v(cid:48)\u2208X\u222aY , \u0001)\n2: if v /\u2208 Xv \u222a Yv then\nreturn \u201cfailure\u201d\n3:\n4: while there exists u \u2208 Xv \u222a Yv unchecked do\nPick an unchecked u \u2208 Xv \u222a Yv and mark it as checked\n5:\nif u \u2208 Xv then\n6:\n7:\nelse\n8:\n9:\n(X\u2217, Y \u2217) \u2190 max_match((zv(cid:48))v(cid:48)\u2208X\u222aY , \u0001)\n10:\nif v \u2208 (X\u2217, Y \u2217) then\n11:\n12:\n13: return (Xv, Yv)\n\n(X, Y ) \u2190 (Xv\\{u}, Yv)\n(X, Y ) \u2190 (Xv, Yv\\{u})\n\n(Xv, Yv) \u2190 (X\u2217, Y \u2217)\n\nTheorem 12. Algorithm 2 outputs one v-minimal match for the given neuron v. If \u0001 = 0 (exact\nmatch), the algorithm outputs the unique v-minimum match provided (zx)x\u2208X and (zy)y\u2208Y are both\nlinearly independent. Moreover, the algorithm always runs in polynomial time.\n\nFinally, we show an algorithm (Algorithm 3) that \ufb01nds all the v-minimal matches in time LO(Nv).\nHere, L is the size of the input (L = (|X| + |Y|) \u00b7 d) and Nv is the number of v-minimal matches\nfor neuron v. Note that in the setting of \u0001 = 0 (exact match) with (zx)x\u2208X and (zy)y\u2208Y being both\nlinearly independent, we have Nv \u2264 1, so Algorithm 3 runs in polynomial time in this case.\nAlgorithm 3 \ufb01nds all the v-minimal matches one by one by calling Algorithm 2 in each iteration. To\nmake sure that we never \ufb01nd the same v-minimal match twice, we always delete a neuron in every\npreviously-found v-minimal match before we start to \ufb01nd the next one.\n\nAlgorithm 3 all_min_match((zv(cid:48))v(cid:48)\u2208X\u222aY , v, \u0001)\n1: S \u2190 \u2205\n2: found \u2190 true\n3: while found do\n4:\n5:\n6:\n\nunchecked do\n\n7:\n\nand mark it as checked\n\nfound \u2190 false\nLet S = {(X1, Y1), (X2, Y2),\u00b7\u00b7\u00b7 , (X|S|, Y|S|)}\nwhile \u00acfound and there exists (u1, u2,\u00b7\u00b7\u00b7 , u|S|) \u2208 (X1\u222aY1)\u00d7(X2\u222aY2)\u00d7\u00b7\u00b7\u00b7\u00d7(X|S|\u222aY|S|)\nPick the next unchecked (u1, u2,\u00b7\u00b7\u00b7 , u|S|) \u2208 (X1\u222aY1)\u00d7(X2\u222aY2)\u00d7\u00b7\u00b7\u00b7\u00d7(X|S|\u222aY|S|)\n(X, Y ) \u2190 (X ,Y)\nfor i = 1, 2,\u00b7\u00b7\u00b7 ,|S| do\nif ui \u2208 X then\nX \u2190 X\\{ui}\nelse\nY \u2190 Y \\{ui}\n\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18: return S\n\nif min_match((zv(cid:48))v(cid:48)\u2208X\u222aY , v, \u0001) doesn\u2019t return \u201cfailure\u201d then\n\n(Xv, Yv) \u2190 min_match((zv(cid:48))v(cid:48)\u2208X\u222aY , v, \u0001)\nS \u2190 S \u222a {(Xv, Yv)}\nfound \u2190 true\n\nTheorem 13. Algorithm 3 outputs all the Nv different v-minimal matches in time LO(Nv). With\nAlgorithm 3, we can \ufb01nd all the simple matches by exploring all v \u2208 X \u222a Y based on Theorem 10.\n\n6\n\n\fIn the worst case, Algorithm 3 is not polynomial time, as Nv is not upper bounded by a constant\nin general. However, under assumptions we call strong linear independence and stability, we show\nthat Algorithm 3 runs in polynomial time. Speci\ufb01cally, we say (zx)x\u2208X satis\ufb01es \u03b8-strong linear\nindependence for \u03b8 \u2208 (0, \u03c0\n2 ] if 0 /\u2208 zX and for any two non-empty disjoint subsets X1, X2 \u2286 X ,\nthe angle between span(zX1) and span(zX2 ) is at least \u03b8. Here, the angle between two subspaces\nis de\ufb01ned to be the minimum angle between non-zero vectors in the two subspaces. We de\ufb01ne\n\u03b8-strong linear independence for (zy)y\u2208Y similarly. We say (zx)x\u2208X and (zy)y\u2208Y satisfy (\u0001, \u03bb)-\nstability for \u0001 \u2265 0 and \u03bb > 1 if \u2200x \u2208 X ,\u2200Y \u2286 Y, dist(zx, span(zY )) /\u2208 (\u0001|zx|, \u03bb\u0001|zx|] and\n\u2200y \u2208 Y,\u2200X \u2286 X , dist(zy, span(zX )) /\u2208 (\u0001|zy|, \u03bb\u0001|zy|]. We prove the following theorem.\nTheorem 14. Suppose \u2203\u03b8 \u2208 (0, \u03c0\n2 ] such that (zx)x\u2208X and (zy)y\u2208Y both satisfy \u03b8-strong linear\nsin \u03b8 + 1)-stability. Then, \u2200v \u2208 X \u222a Y, Nv \u2264 1. As a consequence, Algorithm\nindependence and (\u0001,\n3 \ufb01nds all the v-minimal matches in polynomial time, and we can \ufb01nd all the simple matches in\npolynomial time by exploring all v \u2208 X \u222a Y based on Theorem 10.\n\n2\n\n5 Experiments\n\nWe conduct experiments on architectures of VGG[Simonyan and Zisserman, 2014] and ResNet [He\net al., 2016] on the dataset CIFAR10[Krizhevsky et al.] and ImageNet[Deng et al., 2009]. Here\nwe investigate multiple networks initialized with different random seeds, which achieve reasonable\naccuracies. Unless otherwise noted, we focus on the neurons activated by ReLU.\nThe activation vector zv mentioned in Section 2 is de\ufb01ned as the activations of one neuron v\nover the validation set. For a fully connected layer, zv \u2208 Rd, where d is the number of im-\nages. For a convolutional layer, the activations of one neuron v, given the image Ii, is a fea-\nture map zv(ai) \u2208 Rh\u00d7w. We vectorize the feature map as vec(zv(ai)) \u2208 Rh\u00d7w, and thus\nzv := (vec(zv(a1)), vec(zv(a2)),\u00b7\u00b7\u00b7 , vec(zv(ad))) \u2208 Rh\u00d7w\u00d7d.\n\n5.1 Maximum Match\n\nWe introduce maximum matching similarity to measure the overall similarity between sets of neurons.\nGiven two sets of neurons X ,Y and \u0001, algorithm 1 outputs the maximum match X\u2217, Y \u2217. The\nmaximum matching similarity s under \u0001 is de\ufb01ned as s(\u0001) =\n\n|X\u2217|+|Y \u2217|\n|X|+|Y|\n\nHere we only study neurons in the same layer of two networks with same architecture but initialized\nwith different seeds. For a convolutional layer, we randomly sample d from h \u00d7 w \u00d7 d outputs to\nform an activation vector for several times, and average the maximal matching similarity.\nDifferent Architecture and Dataset We examine several architectures on different dataset. For each\nexperiment, \ufb01ve differently initialized networks are trained, and the maximal matching similarity is\naveraged over all the pairs of networks given \u0001. The similarity values show little variance among\ndifferent pairs, which indicates that this metric reveals a general property of network pairs. The detail\nof network structures and validation accuracies are listed in the Supplementary Section E.2.\nFigure 1 shows maximal matching similarities of all the layers of different architectures under various\n\u0001. From these results, we make the following conclusions:\n\n1. For most of the convolutional layers, the maximum match similarity is very low. For deep\nneural networks, the similarity is almost zero. This is surprising, as it is widely believed that\nthe convolutional layers are trained to extract speci\ufb01c patterns. However, the observation\nshows that different CNNs (with the same architecture) may learn different intermediate\npatterns.\n\n2. Although layers close to the output sometimes exhibit high similarity, it is a simple conse-\nquence of their alignment to the output: First, the output vector of two networks must be\nwell aligned because they both achieve high accuracy. Second, it is necessary that the layers\nbefore output are similar because if not, after a linear transformation, the output vectors will\nnot be similar. Note that in Fig 1 (b) layers close to the output do not have similarity. This\nis because in this experiment the accuracy is relatively low. (See also in Supplementary\nmaterials that, for a trained and an untrained networks which have very different accuracies\nand therefore layers close to output do not have much similarity.)\n\n7\n\n\f(a) CIFAR10-ResNet18\n\n(b) ImageNet-VGG16\n\n(c) CIFAR10-ResNet34\n\nFigure 1: Maximal matching similarities of different architectures on different datasets under various\n\u0001. The x-axis is along the propagation. (a) shows ResNet18 on CIFAR10 validation set, we leave\nother classical architectures like VGG in Supplementary material; (b) shows VGG16 on ImageNet\nvalidation set; (c) shows a deeper ResNet on CIFAR10.\n\n3. There is also relatively high similarity of layers close to the input. Again, this is the\nconsequence of their alignment to the same input data as well as the low-dimension nature\nof the low level layers. More concretely, the fact that each low-level \ufb01lter contains only a\nfew parameters results in a low dimension space after the transformation; and it is much\neasier to have high similarity in low dimensional space than in high dimensional space.\n\n5.2 Simple Match\n\nThe maximum matching illustrates the overall similarity but does not provide information about the\nrelation of speci\ufb01c neurons. Here we analyze the distribution of the size of simple matches to reveal\nthe \ufb01ner structure of a layer. Given \u0001 and two sets of neurons X and Y, algorithm 3 will output all the\nsimple matches.\nFor more ef\ufb01cient implementation, given \u0001, we run the randomized algorithm 2 over each v \u2208 X \u222a Y\nto get one v-minimal match for several iterations. The \ufb01nal result is the collection of all the v-minimal\nmatches found (remove duplicated matches) , which we use to estimate the distribution.\nFigure 2 shows the distribution of the size of simple matches on layers close to input or output\nrespectively. We make the following observations:\n\n1. While the layers close to output are similar overall, it seems that they do not show similarity\nin a local manner. There are very few simple matches with small sizes. It is also an evidence\nthat such similarity is the result of its alignment to the output, rather than intrinsic similar\nrepresentations.\n\n2. The layer close to input shows lower similarity in the \ufb01ner structure. Again, there are few\n\nsimple matches with small sizes.\n\nIn sum, almost no single neuron (or a small set of neurons) learn similar representations, even in\nlayers close to input or output.\n\n8\n\n\f(a) Layer close to input\n\n(b) Layer close to output\n\nFigure 2: The distribution of the sizes of minimal matches of layers close to input and output\nrespectively\n\n6 Conclusion\n\nIn this paper, we investigate the similarity between representations learned by two networks with iden-\ntical architecture but trained from different initializations. We develop a rigorous theory and propose\nef\ufb01cient algorithms. Finally, we apply the algorithms in experiments and \ufb01nd that representations\nlearned by convolutional layers are not as similar as prevalently expected.\nThis raises important questions: Does our result imply two networks learn completely different\nrepresentations, or subspace match is not a good metric for measuring the similarity of representations?\nIf the former is true, we need to rethink not only learning representations, but also interpretability of\ndeep learning. If from each initialization one learns a different representation, how can we interpret\nthe network? If, on the other hand, subspace match is not a good metric, then what is the right metric\nfor similarity of representations? We believe this is a fundamental problem for deep learning and\nworth systematic and in depth studying.\n\n7 Acknowledgement\n\nThis work is supported by National Basic Research Program of China (973 Program) (grant no.\n2015CB352502), NSFC (61573026) and BJNSF (L172037) and a grant from Microsoft Research\nAsia.\n\nReferences\nYann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua\nBengio. Identifying and attacking the saddle point problem in high-dimensional non-convex\noptimization. In Advances in neural information processing systems, pages 2933\u20132941, 2014.\n\nJ. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical\n\nImage Database. In CVPR09, 2009.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\nAlex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced\n\nresearch). URL http://www.cs.toronto.edu/~kriz/cifar.html.\n\nAlex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolu-\ntional neural networks. In Advances in neural information processing systems, pages 1097\u20131105,\n2012.\n\nYixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John Hopcroft. Convergent learning: Do\ndifferent neural networks learn the same representations? In International Conference on Learning\nRepresentation (ICLR \u201916), 2016.\n\nMaithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector\ncanonical correlation analysis for deep learning dynamics and interpretability. In Advances in\nNeural Information Processing Systems, pages 6078\u20136087, 2017.\n\n9\n\n010203040size of simple matches0.000.020.040.060.080.10density = 0.300, |X*| = 44859095100105110size of simple matches0.000.010.020.030.040.050.06density = 0.152, |X*| = 118\fKaren Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. arXiv preprint arXiv:1409.1556, 2014.\n\n10\n\n\f", "award": [], "sourceid": 5841, "authors": [{"given_name": "Liwei", "family_name": "Wang", "institution": "Peking University"}, {"given_name": "Lunjia", "family_name": "Hu", "institution": "Stanford University"}, {"given_name": "Jiayuan", "family_name": "Gu", "institution": "University of California, San Diego"}, {"given_name": "Zhiqiang", "family_name": "Hu", "institution": "Peking University"}, {"given_name": "Yue", "family_name": "Wu", "institution": "Peking University"}, {"given_name": "Kun", "family_name": "He", "institution": "Hua Zhong University of Science and Technology"}, {"given_name": "John", "family_name": "Hopcroft", "institution": "Cornell University"}]}