{"title": "Universal Invariant and Equivariant Graph Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 7092, "page_last": 7101, "abstract": "Graph Neural Networks (GNN) come in many flavors, but should always be either invariant (permutation of the nodes of the input graph does not affect the output) or \\emph{equivariant} (permutation of the input permutes the output). In this paper, we consider a specific class of invariant and equivariant networks, for which we prove new universality theorems. More precisely, we consider networks with a single hidden layer, obtained by summing channels formed by applying an equivariant linear operator, a pointwise non-linearity, and either an invariant or equivariant linear output layer. Recently, Maron et al. (2019) showed that by allowing higher-order tensorization inside the network, universal invariant GNNs can be obtained. As a first contribution, we propose an alternative proof of this result, which relies on the Stone-Weierstrass theorem for algebra of real-valued functions. Our main contribution is then an extension of this result to the \\emph{equivariant} case, which appears in many practical applications but has been less studied from a theoretical point of view. The proof relies on a new generalized Stone-Weierstrass theorem for algebra of equivariant functions, which is of independent interest. Additionally, unlike many previous works that consider a fixed number of nodes, our results show that a GNN defined by a single set of parameters can approximate uniformly well a function defined on graphs of varying size.", "full_text": "Universal Invariant and Equivariant\n\nGraph Neural Networks\n\nNicolas Keriven\n\nGabriel Peyr\u00e9\n\n\u00c9cole Normale Sup\u00e9rieure\n\nParis, France\n\nnicolas.keriven@ens.fr\n\nCNRS and \u00c9cole Normale Sup\u00e9rieure\n\nParis, France\n\ngabriel.peyre@ens.fr\n\nAbstract\n\nGraph Neural Networks (GNN) come in many \ufb02avors, but should always be either\ninvariant (permutation of the nodes of the input graph does not affect the output)\nor equivariant (permutation of the input permutes the output). In this paper, we\nconsider a speci\ufb01c class of invariant and equivariant networks, for which we prove\nnew universality theorems. More precisely, we consider networks with a single\nhidden layer, obtained by summing channels formed by applying an equivariant\nlinear operator, a pointwise non-linearity, and either an invariant or equivariant\nlinear output layer. Recently, Maron et al. (2019b) showed that by allowing higher-\norder tensorization inside the network, universal invariant GNNs can be obtained.\nAs a \ufb01rst contribution, we propose an alternative proof of this result, which relies\non the Stone-Weierstrass theorem for algebra of real-valued functions. Our main\ncontribution is then an extension of this result to the equivariant case, which\nappears in many practical applications but has been less studied from a theoretical\npoint of view. The proof relies on a new generalized Stone-Weierstrass theorem\nfor algebra of equivariant functions, which is of independent interest. Additionally,\nunlike many previous works that consider a \ufb01xed number of nodes, our results\nshow that a GNN de\ufb01ned by a single set of parameters can approximate uniformly\nwell a function de\ufb01ned on graphs of varying size.\n\n1\n\nIntroduction\n\nDesigning Neural Networks (NN) to exhibit some invariance or equivariance to group operations is\na central problem in machine learning (Shawe-Taylor, 1993). Among these, Graph Neural Networks\n(GNN) are primary examples that have gathered a lot of attention for a large range of applications.\nIndeed, since a graph is not changed by permutation of its nodes, GNNs must be either invariant\nto permutation, if they return a result that must not depend on the representation of the input, or\nequivariant to permutation, if the output must be permuted when the input is permuted, for instance\nwhen the network returns a signal over the nodes of the input graph. In this paper, we examine\nuniversal approximation theorems for invariant and equivariant GNNs.\nFrom a theoretical point of view, invariant GNNs have been much more studied than their equivariant\ncounterpart (see the following subsection). However, many practical applications deal with equivari-\nance instead, such as community detection (Chen et al., 2019), recommender systems (Ying et al.,\n2018), interaction networks of physical systems (Battaglia et al., 2016), state prediction (Sanchez-\nGonzalez et al., 2018), protein interface prediction (Fout et al., 2017), among many others. See (Zhou\net al., 2018; Bronstein et al., 2017) for thorough reviews. It is therefore of great interest to increase\nour understanding of equivariant networks, in particular, by extending arguably one of the most\nclassical result on neural networks, namely the universal approximation theorem for multi-layers\nperceptron (MLP) with a single hidden layer (Cybenko, 1989; Hornik et al., 1989; Pinkus, 1999).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fMaron et al. (2019b) recently proved that certain invariant GNNs were universal approximators of\ninvariant continuous functions on graphs. The main goal of this paper is to extend this result to the\nequivariant case, for similar architectures.\n\nOutline and contribution. The outline of our paper is as follows. After reviewing previous works\nand notations in the rest of the introduction, in Section 2 we provide an alternative proof of the\nresult of (Maron et al., 2019b) for invariant GNNs (Theorem 1), which will serve as a basis for the\nequivariant case. It relies on a non-trivial application of the classical Stone-Weierstrass theorem for\nalgebras of real-valued functions (recalled in Theorem 2). Then, as our main contribution, in Section\n3 we prove this result for the equivariant case (Theorem 3), which to the best of our knowledge was\nnot known before. The proof relies on a new version of Stone-Weierstrass theorem (Theorem 4).\nUnlike many works that consider a \ufb01xed number of nodes n, in both cases we will prove that a GNN\ndescribed by a single set of parameters can approximate uniformly well a function that acts on graphs\nof varying size.\n\n1.1 Previous works\n\nThe design of neural network architectures which are equivariant or invariant under group actions is\nan active area of research, see for instance (Ravanbakhsh et al., 2017; Gens and Domingos, 2014;\nCohen and Welling, 2016) for \ufb01nite groups and (Wood and Shawe-Taylor, 1996; Kondor and Trivedi,\n2018) for in\ufb01nite groups. We focus here our attention to discrete groups acting on the coordinates of\nthe features, and more speci\ufb01cally to the action of the full set of permutations on tensors (order-1\ntensors corresponding to sets, order-2 to graphs, order-3 to triangulations, etc).\n\nConvolutional GNN. The most appealing construction of GNN architectures is through the use of\nlocal operators acting on vectors indexed by the vertices. Early de\ufb01nitions of these \u201cmessage passing\u201d\narchitectures rely on \ufb01xed point iterations (Scarselli et al., 2009), while more recent constructions\nmake use of non-linear functions of the adjacency matrix, for instance using spectral decomposi-\ntions (Bruna et al., 2014) or polynomials (Defferrard et al., 2016). We refer to (Bronstein et al.,\n2017; Xu et al., 2019) for recent reviews. For regular-grid graphs, they match classical convolutional\nnetworks (LeCun et al., 1989) which by design can only approximate translation-invariant or equiv-\nariant functions (Yarotsky, 2018). It thus comes at no surprise that these convolutional GNN are not\nuniversal approximators (Xu et al., 2019) of permutation-invariant functions.\n\nFully-invariant GNN. Designing Graph (and their higher-dimensional generalizations) NN which\nare equivariant or invariant to the whole permutation group (as opposed to e.g. only translations)\nrequires the use of a small sub-space of linear operators, which is identi\ufb01ed in (Maron et al., 2019a).\nThis generalizes several previous constructions, for instance for sets (Zaheer et al., 2017; Hartford\net al., 2018) and points clouds (Qi et al., 2017). Universality results are known to hold in the special\ncases of sets, point clouds (Qi et al., 2017) and discrete measures (de Bie et al., 2019) networks.\nIn the invariant GNN case, the universality of architectures built using a single hidden layer of\nsuch equivariant operators followed by an invariant layer is proved in (Maron et al., 2019b) (see\nalso (Kondor et al., 2018)). This is the closest work from our, and we will provide an alternative\nproof of this result in Section 2, as a basis for our main result in Section 3.\nUniversality in the equivariant case has been less studied. Most of the literature focuses on equivari-\nance to translation and its relation to convolutions (Kondor et al., 2018; Cohen and Welling, 2016),\nwhich are ubiquitous in image processing. In this context, Yarotsky (2018) proved the universality\nof some translation-equivariant networks. Closer to our work, universality of NNs equivariant to\npermutations acting on point clouds has been recently proven in (Sannai et al., 2019), however their\ntheorem does not allow for high-order inputs like graphs. It is the purpose of our paper to \ufb01ll this\nmissing piece and prove the universality of a class of equivariant GNNs for high-order inputs such as\n(hyper-)graphs.\n\n1.2 Notations and de\ufb01nitions\n\nIn this paper, (hyper-)graphs with n nodes are represented by tensors G \u2208 Rnd indexed\nGraphs.\nby 1 (cid:54) i1, . . . , id (cid:54) n. For instance, \u201cclassical\u201d graphs are represented by edge weight matrices\n(d = 2), and hyper-graphs by high-order tensors of \u201cmulti-edges\u201d connecting more than two nodes.\n\n2\n\n\fNote that we do not impose G to be symmetric, or to contain only non-negative elements. In the rest\nof the paper, we \ufb01x some d (cid:62) 1 for the order of the inputs, however we allow n to vary.\nPermutations. Let [n] def.= {1, . . . , n}. The set of permutations \u03c3 : [n] \u2192 [n] (bijections from [n]\nto itself) is denoted by On, or simply O when there is no ambiguity. Given a permutation \u03c3 and an\norder-k tensor G \u2208 Rnk, a \u201cpermutation of nodes\u201d on G is denoted by \u03c3 (cid:63) G and de\ufb01ned as\n\n(\u03c3 (cid:63) G)\u03c3(i1),...,\u03c3(ik) = Gi1,...,ik .\n\nWe denote by P\u03c3 \u2208 {0, 1}n\u00d7n the permutation matrix corresponding to \u03c3, or simply P when there is\nno ambiguity. For instance, for G \u2208 Rn2 we have \u03c3 (cid:63) G = P GP (cid:62).\nTwo graphs G1, G2 are said isomorphic if there is a permutation \u03c3 such that G1 = \u03c3(cid:63)G2. If G = \u03c3(cid:63)G,\nwe say that \u03c3 is a self-isomorphism of G. Finally, we denote by O(G) def.= {\u03c3 (cid:63) G ; \u03c3 \u2208 O} the orbit\nof all the permuted versions of G.\nInvariant and equivariant linear operators. A function f : Rnk \u2192 R is said to be invariant if\nf (\u03c3 (cid:63) G) = f (G) for every permutation \u03c3. A function f : Rnk \u2192 Rn(cid:96) is said to be equivariant if\nf (\u03c3 (cid:63)G) = \u03c3 (cid:63)f (G). Our construction of GNNs alternates between linear operators that are invariant\nor equivariant to permutations, and non-linearities. Maron et al. (2019a) elegantly characterize all\nsuch linear functions, and prove that they live in vector spaces of dimension, respectively, exactly\nb(k) and b(k + (cid:96)), where b(i) is the ith Bell number. An important corollary of this result is that the\ndimension of this space does not depend on the number of nodes n, but only on the order of the input\nand output tensors. Therefore one can parameterize linearly for all n such an operator by the same set\nof coef\ufb01cients. For instance, a linear equivariant operator F : Rn2 \u2192 Rn2 from matrices to matrices\nis formed by a linear combination of b(4) = 15 basic operators such as \u201csum of rows replicated on\nthe diagonal\u201d, \u201csum of columns replicated on the rows\u201d, and so on. The 15 coef\ufb01cients used in this\nlinear combination de\ufb01ne the \u201csame\u201d linear operator for every n.\n\nInvariant and equivariant Graph Neural Nets. As noted by Yarotsky (2018), it is in fact trivial\nto build invariant universal networks for \ufb01nite groups of symmetry: just take a non-invariant universal\narchitecture, and perform a group averaging. However, this holds little interest in practice, since\nthe group of permutation is of size n!. Instead, researchers use architectures for which invariance is\nhard-coded into the construction of the network itself. The same remark holds for equivariance.\nIn this paper, we consider one-layer GNNs of the form:\n\nS(cid:88)\n\n(cid:104)\n\n(cid:105)\n\nf (G) =\n\nHs\n\n\u03c1(Fs[G] + Bs)\n\n+ b,\n\n(1)\n\ns=1\n\ndescribed in the previous paragraph, a GNN of the form (1) is described by 1+(cid:80)S\nparameters in the invariant case and 1 +(cid:80)S\n\nwhere Fs : Rnd \u2192 Rnks are linear equivariant functions that yield ks-tensors (i.e. they potentially\nincrease or decrease the order of the input tensor), and Hs are invariant linear operators Hs : Rnks \u2192\nR (resp. equivariant linear operators Hs : Rnks \u2192 Rn), such that the GNN is globally invariant\n(resp. equivariant). The invariant case is studied in Section 2, and the equivariant in Section 3. The\nbias terms Bs \u2208 Rnks are equivariant, so that Bs = \u03c3 (cid:63) Bs for all \u03c3. They are also characterized\nby Maron et al. (2019a) and belong to a linear space of dimension b(ks). We illustrate this simple\narchitecture in Fig. 1.\nIn light of the characterization by Maron et al. (2019a) of linear invariant and equivariant operators\ns=1 b(ks+d)+2b(ks)\ns=1 b(ks + d) + b(ks + 1) + b(ks) in the equivariant. As\nmentioned earlier, this number of parameters does not depend on the number of nodes n, and a GNN\ndescribed by a single set of parameters can be applied to graphs of any size. In particular, we are\ngoing to show that a GNN approximates uniformly well a continuous function for several n at once.\nThe function \u03c1 is any locally Lipschitz pointwise non-linearity for which the Universal Approximation\nTheorem for MLP applies. We denote their set FMLP. This includes in particular any continuous\nfunction that is not a polynomial (Pinkus, 1999). Among these, we denote the sigmoid \u03c1sig(x) =\nex/(1 + ex).\n\n3\n\n\fF1\n\nF2\n\nFS\n\nRnk1\n\nRnk2\n...\nRnkS\n\n\u03c1(\u00b7 + B1)\n\n\u03c1(\u00b7 + B2)\n\n\u03c1(\u00b7 + BS)\n\nRnk1\n\nRnk2\n...\nRnkS\n\nRnd\n\nH1\n\nH2\n\nHS\n\n(cid:80)\n\n(cid:40)\n\ny \u2208 R\ny \u2208 Rn\n\n(invariant)\n(equivariant)\n\nFigure 1: The model of GNNs studied in this paper. For each channel s (cid:54) S, the input tensor is passed\nthrough an equivariant operator Fs : Rnd \u2192 Rnks , a non-linearity with some added equivariant bias Bs, and\na \ufb01nal operator Hs that is either invariant (Section 2) or equivariant (Section 3). These GNNs are universal\napproximators of invariant or equivariant continuous functions (Theorems 1 and 3).\n\nWe denote by Ninv.(\u03c1) (resp. Neq.(\u03c1)) the class of invariant (resp. equivariant) 1-layer networks of\nthe form (1) (with S and ks being arbitrarily large). Our contributions show that they are dense in the\nspaces of continuous invariant (resp. equivariant) functions.\n\n2 The case of invariant functions\n\nMaron et al. (2019b) recently proved that invariant GNNs similar to (1) are universal approximators\nof continuous invariant functions. As a warm-up, we propose an alternative proof of (a variant of)\nthis result, that will serve as a basis for our main contribution, the equivariant case (Section 3).\n\nEdit distance. For invariant functions, isomorphic graphs are undistinguishable, and therefore we\nwork with a set of equivalence classes of graphs, where two graphs are equivalent if isomorphic. We\nde\ufb01ne such a set for any number n (cid:54) nmax of nodes and bounded G\n\n(cid:110)O (G) ; G \u2208 Rnd with n (cid:54) nmax,(cid:107)G(cid:107) (cid:54) R\n\n(cid:111)\n\nGinv.\n\ndef.=\n\n,\n\nwhere we recall that O (G) = {\u03c3 (cid:63) G ; \u03c3 \u2208 O} is the set of every permuted versions of G, here seen\nas an equivalence class.\nWe need to equip this set with a metric that takes into account graphs with different number of nodes.\nA distance often used in the literature is the graph edit distance (Sanfeliu and Fu, 1983). It relies on\nde\ufb01ning a set of elementary operations o and a cost c(o) associated to each of them, here we consider\nnode addition and edge weight modi\ufb01cation. The distance is then de\ufb01ned as\n\nk(cid:88)\n\ni=1\n\ndedit(O (G1) ,O (G2)) def.=\n\nmin\n\n(o1,...,ok)\u2208P(G1,G2)\n\nc(oi)\n\n(2)\n\nwhere P(G1, G2) contains every sequence of operation to transform G1 into a graph isomor-\nphic to G2, or G2 into G1. Here we consider c(node_addition) = c for some constant c > 0,\nc(edge_weight_change) = |w \u2212 w(cid:48)| where the weight change is from w to w(cid:48), and \u201cedge\u201d\nrefers to any element of the tensor G \u2208 Rnd. Note that, if we have dedit(O (G1) ,O (G2)) < c,\nthen G1 and G2 have the same number of nodes, and in that case dedit(O (G1) ,O (G2)) =\nmin\u03c3\u2208On (cid:107)G1 \u2212 \u03c3 (cid:63) G2(cid:107)1 , where (cid:107)\u00b7(cid:107)1 is the element-wise (cid:96)1 norm, since each edge must be trans-\nformed into another.\nWe denote by C(Ginv., dedit) the space of real-valued functions on Ginv. that are continuous with respect\nto dedit, equipped with the in\ufb01nity norm of uniform convergence. We then have the following result.\nTheorem 1. For any \u03c1 \u2208 FMLP, Ninv.(\u03c1) is dense in C(Ginv., dedit).\n\nComparison with (Maron et al., 2019b). A variant of Theorem 1 was proved in (Maron et al.,\n2019b). The two proofs are however different: their proof relies on the construction of a basis of\ninvariant polynomials and on classical universality of MLPs, while our proof is a direct application of\nStone-Weierstrass theorem for algebras of real-valued functions. See the next subsection for details.\n\n4\n\n\fOne improvement of our result with respect to the one of (Maron et al., 2019b) is that it can handle\ngraphs of varying sizes. As mentioned in the introduction, a single set of parameters de\ufb01nes a GNN\nthat can be applied to graphs of any size. Theorem 1 shows that any continuous invariant function\nis uniformly well approximated by a GNN on the whole set Ginv., that is, for all numbers of nodes\nn (cid:54) nmax simultaneously. On the contrary, Maron et al. (2019b) work with a \ufb01xed n, and it does\nnot seem that their proof can extend easily to encompass several n at once. A weakness of our\nproof is that it does not provide an upper bound on the order of tensorization ks. Indeed, through\nNoether\u2019s theorem on polynomials, the proof of Maron et al. (2019b) shows that ks (cid:54) nd(nd \u2212 1)/2\nis suf\ufb01cient for universality, which we cannot seem to deduce from our proof. Moreover, they provide\na lower-bound ks (cid:62) nd below which universality cannot be achieved.\n\n2.1 Sketch of proof of Theorem 1\n\nThe proof for the invariant case will serve as a basis for the equivariant case in the Section 3. It relies\non Stone-Weierstrass theorem, which we recall below.\nTheorem 2 (Stone-Weierstrass (Rudin (1991), Thm. 5.7)). Suppose X is a compact Hausdorff\nspace and A is a subalgebra of the space of continuous real-valued functions C(X) which contains a\nnon-zero constant function. Then A is dense in C(X) if and only if it separates points, that is, for all\nx (cid:54)= y in X there exists f \u2208 A such that f (x) (cid:54)= f (y).\nWe will construct a class of GNNs that satisfy all these properties in Ginv.. As we will see, unlike\nclassical applications of this theorem to e.g. polynomials, the main dif\ufb01culty here will be to prove the\nseparation of points. We start by observing that Ginv. is indeed a compact set for dedit.\nProperties of (Ginv., dedit). Let us \ufb01rst note that the metric space (Ginv., dedit) is Hausdorff (i.e. sepa-\nrable, all metric spaces are). For each O (G1) ,O (G2) \u2208 Ginv. we have: if dedit(O (G1) ,O (G2)) < c,\nthen the graphs have the same number of nodes, and in that case dedit(O (G1)O (G2)) (cid:54)\n(cid:107)G1 \u2212 G2(cid:107)1. Therefore, the embedding G (cid:55)\u2192 O (G) is continuous (locally Lipschitz). As the\n, the set Ginv. is indeed compact.\n\ncontinuous image of the compact(cid:83)nmax\n\n; (cid:107)G(cid:107) (cid:54) R\n\nG \u2208 Rnd\n\n(cid:110)\n\n(cid:111)\n\nn=1\n\nAlgebra of invariant GNNs. Unfortunately, Ninv.(\u03c1) is not a subalgebra. Following Hornik et al.\n(1989), we \ufb01rst need to extend it to be closed under multiplication. We do that by allowing Kronecker\nproducts inside the invariant functions:\n\n(cid:104)\n\u03c1 (Fs1[G] + Bs1) \u2297 . . . \u2297 \u03c1 (FsTs[G] + BsTs )\n\n(cid:105)\n\n+ b\n\n(3)\n\nS(cid:88)\n\ns=1\n\nf (G) =\n\nHs\n\n(cid:80)\n\nwhere Fst yields kst-tensors, Hs : Rn\n(\u03c3 (cid:63) G) \u2297 (\u03c3 (cid:63) G(cid:48)) = \u03c3 (cid:63) (G \u2297 G(cid:48)), they are indeed invariant. We denote by N \u2297\nGNNs of this form, with S, Ts, kst arbitrarily large.\nLemma 1. For any locally Lipschitz \u03c1, N \u2297\ninv.(\u03c1) is a subalgebra in C(Ginv., dedit).\nThe proof, presented in Appendix A.1.1 follows from manipulations of Kronecker products.\n\nt kst \u2192 R are invariant, and Bst are equivariant bias. By\ninv.(\u03c1) the set of all\n\ninv.(\u03c1sig) separates points.\n\nSeparability. The main dif\ufb01culty in applying Stone-Weierstrass theorem is the separation of points,\nwhich we prove in the next Lemma.\nLemma 2. N \u2297\nThe proof, presented in Appendix A.1.2, proceeds by contradiction: we show that two graphs G, G(cid:48)\nthat coincides for every GNNs are necessarily permutation of each other. Applying Stone-Weierstrass\ntheorem, we have thus proved that N \u2297\nThen, following Hornik et al. (1989), we go back to the original class Ninv.(\u03c1), by applying: (i) a\nFourier approximation of \u03c1sig, (ii) the fact that a product of cos is also a sum of cos, and (iii) an\napproximation of cos by any other non-linearity. The following Lemma is proved in Appendix A.1.3,\nand concludes the proof of Thm 1.\ninv.(cos) = Ninv.(cos);\nLemma 3. We have the following: (i) N \u2297\n(iii) for any \u03c1 \u2208 FMLP, Ninv.(\u03c1) is dense in Ninv.(cos).\n\ninv.(\u03c1sig) is dense in C(Ginv., dedit).\n\ninv.(cos) is dense in N \u2297\n\ninv.(\u03c1sig); (ii) N \u2297\n\n5\n\n\f3 The case of equivariant functions\n\nThis section contains our main contribution. We examine the case of equivariant functions that return\na vector f (G) \u2208 Rn when G has n nodes, such that f (\u03c3 (cid:63) G) = \u03c3 (cid:63) f (G). In that case, isomorphic\ngraphs are not equivalent anymore. Hence we consider a compact set of graphs\n\nG \u2208 Rnd\n\n; n (cid:54) nmax,(cid:107)G(cid:107) (cid:54) R\n\n,\n\n(cid:111)\n\ndef.=\n\nGeq.\n\n(cid:110)\n(cid:26)(cid:107)G \u2212 G(cid:48)(cid:107)\n\n\u221e\n\nLike the invariant case, we consider several numbers of nodes n (cid:54) nmax and will prove uniform\napproximation over them. We do not use the edit distance but a simpler metric:\n\nd(G, G(cid:48)) =\n\nif G and G(cid:48) have the same number of nodes,\notherwise.\n\nfor any norm (cid:107)\u00b7(cid:107) on Rnd.\nThe set of equivariant continuous functions is denoted by Ceq.(Geq., d), equipped with the in\ufb01nity\nnorm (cid:107)f(cid:107)\u221e = supG\u2208Geq. (cid:107)f (G)(cid:107)\u221e. We recall that Neq.(\u03c1) \u2282 Ceq.(Geq., d) denotes one-layer GNNs\nof the form (1), with equivariant output operators Hs. Our main result is the following.\nTheorem 3. For any \u03c1 \u2208 FMLP, Neq.(\u03c1) is dense in Ceq.(Geq., d).\nThe proof, detailed in the next section, follows closely the previous proof for invariant functions,\nbut is signi\ufb01cantly more involved. Indeed, the classical version of Stone-Weierstrass only provides\ndensity of a subalgebra of functions in the whole space of continuous functions, while in this case\nCeq.(Geq., d) is already a particular subset of continuous functions. On the other hand, it seems dif\ufb01cult\nto make use of fully general versions of Stone-Weierstrass theorem, for which some questions are still\nopen (Glimm, 1960). Hence we prove a new, specialized Stone-Weierstrass theorem for equivariant\nfunctions (Theorem 4), obtained with a non-trivial adaptation of the constructive proof by Brosowski\nand Deutsch (1981).\nLike the invariant case, our theorem proves uniform approximation for all numbers of nodes n (cid:54) nmax\nat once by a single GNN. As is detailed in the next subsection, our proof of the generalized Stone-\nWeierstrass theorem relies on being able to sort the coordinates of the output space Rn, and therefore\nour current proof technique does not extend to high-order output Rn(cid:96) (graph to graph mappings),\nwhich we leave for future work. For the same reason, while the previous invariant case could be easily\nextended to invariance to subgroups of On, as is done by Maron et al. (2019b), for the equivariant\ncase our theorem only applies when considering the full permuation group On. Nevertheless, our\ngeneralized Stone-Weierstrass theorem may be applicable in other contexts where equivariance to\npermutation is a desirable property.\n\nComparison with (Sannai et al., 2019). Sannai et al. (2019) recently proved that equivariant NNs\nacting on point clouds are universal, that is, for d = 1 in our notations. Despite the apparent similarity\nwith our result, there is a fundamental obstruction to extending their proof to high-order input tensors\nlike graphs. Indeed, it strongly relies on Theorem 2 of (Zaheer et al., 2017) that characterizes invariant\nfunctions Rn \u2192 R, which is no longer valid for high-order inputs.\n\n3.1 Sketch of proof of Theorem 3: an equivariant version of Stone-Weierstrass theorem\nWe \ufb01rst need to introduce a few more notations. For a subset I \u2282 [n], we de\ufb01ne OI\ndef.=\n{\u03c3 \u2208 On ; \u2203i \u2208 I, j \u2208 I c, \u03c3(i) = j or \u03c3(j) = i} the set of permutations that exchange at least one\nindex between I and I c. Indexing of vectors (or multivariate functions) is denoted by brackets, e.g.\n[x]I or [f ]I, and inequalities x (cid:62) a are to be understood element-wise.\n\nA new Stone-Weierstrass theorem. We de\ufb01ne the \u201cmultiplication\u201d of two multivariate functions\nusing the Hadamard product (cid:12), i.e. the component-wise multiplication. Since (\u03c3 (cid:63) x) (cid:12) (\u03c3 (cid:63) x(cid:48)) =\n\u03c3 (cid:63) (x (cid:12) x(cid:48)), it is easy to see that Ceq.(Geq., d) is closed under multiplication, and is therefore a (strict)\nsubalgebra of the set of all continuous functions that return a vector in Rn for an input graph with n\nnodes. As mentioned before, because of this last fact we cannot directly apply Stone-Weierstrass\ntheorem. We therefore prove a new generalized version.\n\n6\n\n\fFigure 2: Illustration of strategy of proof for the equivariant Stone-Weierstrass theorem (Theorem 4). Consider-\ning a function f that we are trying to approximate and a graph G for which the coordinates of f (G) are sorted\nby decreasing order, we approximate f (G) by summing step-functions fi, whose \ufb01rst coordinates are close to 1,\nand otherwise close to 0.\n\nTheorem 4 (Stone-Weierstrass for equivariant functions). Let A be a subalgebra of Ceq.(Geq., d),\nsuch that A contains the constant function 1 and:\n\u2013 (Separability) for all G, G(cid:48) \u2208 Geq. with number of nodes respectively n and n(cid:48) such that G /\u2208 O(G(cid:48)),\n\u2013 (\u201cSelf\u201d-separability) for all number of nodes n (cid:54) nmax, I \u2282 [n], G \u2208 Geq. with n nodes that has\n\nfor any k \u2208 [n], k(cid:48) \u2208 [n(cid:48)], there exists f \u2208 A such that [f (G)]k (cid:54)= [f (G(cid:48))]k(cid:48) ;\nno self-isomorphism in OI, and k \u2208 I, (cid:96) \u2208 I c, there is f \u2208 A such that [f (G)]k (cid:54)= [f (G)](cid:96).\n\nThen A is dense in Ceq.(Geq., d).\nIn addition to a \u201cseparability\u201d hypothesis, which is similar to the classical one, Theorem 4 requires a\n\u201cself\u201d-separability condition, which guarantees that f (G) can have different values on its coordinates\nunder appropriate assumptions on G. We give below an overview of the proof of Theorem 4, the full\ndetails can be found in Appendix B.\nOur proof is inspired by the one for the classical Stone-Weierstrass theorem (Thm. 2) of Brosowski\nand Deutsch (1981). Let us \ufb01rst give a bit of intuition on this earlier proof.\nIt relies on the\nexplicit construction of \u201cstep\u201d-functions: given two disjoint closed sets A and B, they show that\nA contains functions that are approximately 0 on A and approximately 1 on B. Then, given a\nfunction f : X \u2192 R (non-negative w.l.o.g.) that we are trying to approximate and \u03b5 > 0, they de\ufb01ne\nAk = {x ; f (x) (cid:54) (k \u2212 1/3)\u03b5} and Bk = {x ; f (x) (cid:62) (k + 1/3)\u03b5} as the lower (resp. upper)\nlevel sets of f for a grid of values with precision \u03b5. Then, taking the step-functions fk between Ak\nk fk, since for each x only the\nright number of fk is close to 1, the others are close to 0.\nGiven a function f \u2208 Ceq.(Geq., d)\nThe situation is more complicated in our case.\nto approximate, we work in the compact subset of Geq. where the co-\nthat we want\ndef.=\nordinates of f are ordered,\nG \u2208 Geq. ; if G \u2208 Rnd: [f (G)]1 (cid:62) [f (G)]2 (cid:62) . . . (cid:62) [f (G)]n\n. Then, we will prove the existence\nof step-functions such that: when A and B satisfy some appropriate hypotheses, the step-function is\nclose to 0 on A, and only the \ufb01rst coordinates are close to 1 on B, the others are close to 0. Indeed,\nby combining such functions, we can approximate a vector of ordered coordinates (Fig. 2). The\nconstruction of such step-functions is done in Lemma 7. Finally, we consider modi\ufb01ed level-sets\n\nand Bk, it is easy to prove that f is well-approximated by g = \u03b5(cid:80)\n\nsince by permutation it covers every case:\n\n(cid:110)\n\n(cid:111)\n\nGf\n\nG \u2208 Gf \u2229 Rnd\n\n; [f (G)](cid:96) \u2212 [f (G)](cid:96)+1 (cid:54) (k \u2212 1/3)\u03b5\n\nG \u2208 Gf \u2229 Rnd\n\n; [f (G)](cid:96) \u2212 [f (G)](cid:96)+1 (cid:62) (k + 1/3)\u03b5\n\n(cid:16)Gf \u2229 R(n(cid:48))d(cid:17)\n\n,\n\n(cid:111) \u222a (cid:91)\n(cid:111)\n\nn(cid:48)(cid:54)=n\n\n(cid:110)\n(cid:110)\n\nAn,(cid:96)\n\nk\n\ndef.=\n\nBn,(cid:96)\n\nk\n\ndef.=\n\nand show that g = \u03b5(cid:80)\n\nthat distinguish \u201cjumps\u201d between (ordered) coordinates. We de\ufb01ne the associated step-functions f n,(cid:96)\nk ,\n\nk,n,(cid:96) f n,(cid:96)\n\nk\n\nis a valid approximation of f.\n\nEnd of the proof. The rest of the proof of Theorem 3 is similar to the invariant case. We \ufb01rst\nbuild an algebra of GNNs, again by considering nets of the form (3), where we replace the Hs\u2019s by\nequivariant linear operators in this case. We denote this space by N \u2297\nLemma 4. N \u2297\n\neq.(\u03c1) is a subalgebra of Ceq.(Geq., d).\n\neq.(\u03c1).\n\n7\n\n\feq.(\u03c1sig.) satis\ufb01es both the separability and self-separability conditions.\n\nThe proof, presented in Appendix A.2.1, is very similar to that of Lemma 1. Then we show the two\nseparation conditions for equivariant GNNs.\nLemma 5. N \u2297\nThe proof is presented in Appendix A.2.2. The \u201cnormal\u201d separability is in fact equivalent to the\nprevious one (Lemma 2), since we can construct an equivariant network by simply stacking an\ninvariant network on every coordinate. The self-separability condition is proved in a similar way.\nFinally we go back to Neq.(\u03c1) in exactly the same way. The proof of Lemma 6 is exactly similar to\nthat of Lemma 3 and is omitted.\nLemma 6. We have the following: (i) N \u2297\neq.(cos) = Neq.(cos);\n(iii) for any \u03c1 \u2208 FMLP, Neq.(\u03c1) is dense in Neq.(cos).\n\neq.(cos) is dense in N \u2297\n\neq.(\u03c1sig); (ii) N \u2297\n\n4 Numerical illustrations\n\nThis section provides numerical illustrations of our \ufb01ndings on\nsimple synthetic examples. The goal is to examine the impact\nof the tensorization orders ks and the width S. The code is\navailable at https://github.com/nkeriven/univgnn. We\nemphasize that the contribution of the present paper is \ufb01rst and\nforemost theoretical, and that, like MLPs with a single hidden\nlayer, we cannot expect the shallow GNNs (1) to be state-of-the-\nart and compete with deep models, despite their universality. A\nbenchmarking of deep GNNs that use invariant and equivariant\nlinear operators is done in (Maron et al., 2019a).\nWe consider graphs, represented using their adjacency matrices\n(i.e. 2-ways tensor, so that d = 2). The synthetic graphs are\ndrawn uniformly among 5 graph topologies (complete graph,\nstar, cycle, path or wheel) with edge weights drawn indepen-\ndently as the absolute value of a centered Gaussian variable.\nSince our approximation results are valid for several graph\nsizes simultaneously, both training and testing datasets contain\n1.4 \u00b7 104 graphs, half with 5 nodes and half with 10 nodes.\nThe training is performed by minimizing a square Euclidean\nloss (MSE) on the training dataset. The minimization is per-\nformed by stochastic gradient descent using the ADAM opti-\nmizer (Kingma and Ba, 2014). We consider two different regression tasks: (i) in the invariant case,\nthe scalar to predict is the geodesic diameter of the graph, (ii) in the equivariant case, the vector to\npredict assigns to each node the length of the longest shortest-path emanating from it. While these\nfunctions can be computed using polynomial time all-pairs shortest paths algorithms, they are highly\nnon-local, and are thus challenging to learn using neural network architectures. The GNNs (1) are\nimplemented with a \ufb01xed tensorization order ks = k \u2208 {1, 2, 3} and \u03c1 = \u03c1sig..\nFigure 3 shows that, on these two cases, when increasing the width S, the out-of-sample prediction\nerror quickly stagnates (and sometime increasing too much S can slightly degrade performances by\nmaking the training harder). In sharp contrast, increasing the tensorization order k has a signi\ufb01cant\nimpact and lowers this optimal error value. This support the fact that universality relies on the use of\nhigher tensorization order. This is a promising direction of research to integrate higher order tensors\nwithing deeper architecture to better capture complex functions on graphs.\n\nFigure 3: MSE results after 150\nepochs, in the invariant (top) and equiv-\nariant (bottom) cases, averaged over 5\nexperiments. Dashed lines represent\nthe testing error.\n\n5 Conclusion\n\nIn this paper, we proved the universality of a class of one hidden layer equivariant networks. Handling\nthis vector-valued setting required to extend the classical Stone-Weierstrass theorem. It remains an\nopen problem to extend this technique of proof for more general equivariant networks whose outputs\nare graph-valued, which are useful for instance to model dynamic graphs using recurrent architectures\n(Battaglia et al., 2016). Another outstanding open question, formulated in (Maron et al., 2019b), is\nthe characterization of the approximation power of networks whose tensorization orders ks inside the\nlayers are bounded, since they are much more likely to be implemented on large graphs in practice.\n\n8\n\n101102s1234567lossk=1k=2k=3101102s102030lossk=1k=2k=3\fReferences\nP. W. Battaglia, R. Pascanu, M. Lai, D. Rezende, and K. Kavukcuoglu. Interaction Networks for\nLearning about Objects, Relations and Physics. In Advances in Neural Information and Processing\nSystems (NIPS), pages 4509\u20134517, 2016.\n\nM. M. Bronstein, J. Bruna, Y. Lecun, A. Szlam, and P. Vandergheynst. Geometric Deep Learning:\n\nGoing beyond Euclidean data. IEEE Signal Processing Magazine, 34(4):18\u201342, 2017.\n\nB. Brosowski and F. Deutsch. An elementary proof of the Stone-Weierstrass Theorem. Proceedings\n\nof the American Mathematical Society, 81(1):89\u201392, 1981.\n\nJ. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral Networks and Locally Connected Networks\n\non Graphs. In ICLR, pages 1\u201314, 2014.\n\nZ. Chen, X. Li, and J. Bruna. Supervised Community Detection with Line Graph Neural Networks.\n\nIn ICLR, 2019.\n\nT. Cohen and M. Welling. Group equivariant convolutional networks. In International conference on\n\nmachine learning, pages 2990\u20132999, 2016.\n\nG. Cybenko. Approximation by superpositions of a sigmoidal function. Mahematics of Control,\n\nSignals, and Systems, 2:303\u2013314, 1989.\n\nG. de Bie, G. Peyr\u00e9, and M. Cuturi. Stochastic deep networks. In Proceedings of ICML 2019, 2019.\n\nM. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional Neural Networks on Graphs with\nFast Localized Spectral Filtering. In Advances in Neural Information and Processing Systems\n(NIPS), 2016.\n\nA. Fout, B. Shariat, J. Byrd, and A. Ben-Hur. Protein Interface Prediction using Graph Convolutional\n\nNetworks. Nips, (Nips):6512\u20136521, 2017.\n\nR. Gens and P. M. Domingos. Deep symmetry networks.\n\nprocessing systems, pages 2537\u20132545, 2014.\n\nIn Advances in neural information\n\nJ. Glimm. A Stone-Weierstrass Theorem for C * -Algebras. Annals of Mathematics, 72(2):216\u2013244,\n\n1960.\n\nJ. Hartford, D. R. Graham, K. Leyton-Brown, and S. Ravanbakhsh. Deep models of interactions\n\nacross sets. arXiv preprint arXiv:1803.02879, 2018.\n\nK. Hornik, M. Stinchcombe, and H. White. Multilayer Feedforward Networks are Universal Approx-\n\nimators. Neural Networks, 2:359\u2013366, 1989.\n\nD. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2014.\n\nR. Kondor and S. Trivedi. On the generalization of equivariance and convolution in neural networks\n\nto the action of compact groups. arXiv preprint arXiv:1802.03690, 2018.\n\nR. Kondor, H. T. Son, H. Pan, B. Anderson, and S. Trivedi. Covariant compositional networks for\n\nlearning graphs. arXiv preprint arXiv:1801.02144, 2018.\n\nY. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel.\nBackpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541\u2013551,\n1989.\n\nH. Maron, H. Ben-Hamu, N. Shamir, and Y. Lipman. Invariant and Equivariant Graph Networks. In\n\nICLR, pages 1\u201313, 2019a.\n\nH. Maron, E. Fetaya, N. Segol, and Y. Lipman. On the Universality of Invariant Networks. In\n\nInternational Conference on Machine Learning (ICML), 2019b.\n\nA. Pinkus. Approximation theory of the MLP model in neural networks. Acta Numerica, 8(May):\n\n143\u2013195, 1999.\n\n9\n\n\fC. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classi\ufb01cation\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern\n\nand segmentation.\nRecognition, pages 652\u2013660, 2017.\n\nS. Ravanbakhsh, J. Schneider, and B. Poczos. Equivariance through parameter-sharing. In Proceedings\nof the 34th International Conference on Machine Learning-Volume 70, pages 2892\u20132901. JMLR.\norg, 2017.\n\nW. Rudin. Functional Analysis. 1991.\n\nA. Sanchez-Gonzalez, N. Heess, J. T. Springenberg, J. Merel, M. Riedmiller, R. Hadsell,\nand P. Battaglia. Graph networks as learnable physics engines for inference and control.\narxiv:1806.01242, 2018.\n\nA. Sanfeliu and K.-S. Fu. A distance measure between attributed relational graphs for pattern\n\nrecognition. IEEE transactions on systems, man, and cybernetics, (3):353\u2013362, 1983.\n\nA. Sannai, Y. Takai, and M. Cordonnier. Universal approximations of permutation invari-\n\nant/equivariant functions by deep neural networks. ArXiv: 1903.01939, 2019.\n\nF. Scarselli, M. Gori, A. C. Tsoi, G. Monfardini, M. Hagenbuchner, and G. Monfardini. The graph\n\nneural network model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\nJ. Shawe-Taylor. Symmetries and discriminability in feedforward network architectures. IEEE\n\nTransactions on Neural Networks, 4(5):816\u2013826, 1993.\n\nJ. Wood and J. Shawe-Taylor. Representation theory and invariant neural networks. Discrete applied\n\nmathematics, 69(1-2):33\u201360, 1996.\n\nK. Xu, W. Hu, J. Leskovec, and S. Jegelka. How Powerful are Graph Neural Networks? In ICLR,\n\npages 1\u201315, 2019.\n\nD. Yarotsky. Universal approximations of invariant maps by neural networks. ArXiv: 1804.10306,\n\npages 1\u201364, 2018.\n\nR. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec. Graph Convolutional\nNeural Networks for Web-Scale Recommender Systems. In Proceedings of the 24th ACM SIGKDD\nInternational Conference on Knowledge Discovery & Data Mining, pages 974\u2013983, 2018.\n\nM. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets.\n\nIn Advances in neural information processing systems, pages 3391\u20133401, 2017.\n\nJ. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun. Graph Neural Networks: A\n\nReview of Methods and Applications. ArXiv: 1812.08434, pages 1\u201320, 2018.\n\n10\n\n\f", "award": [], "sourceid": 3832, "authors": [{"given_name": "Nicolas", "family_name": "Keriven", "institution": "Ecole Normale Sup\u00e9rieure"}, {"given_name": "Gabriel", "family_name": "Peyr\u00e9", "institution": "CNRS and ENS"}]}