{"title": "GOT: An Optimal Transport framework for Graph comparison", "book": "Advances in Neural Information Processing Systems", "page_first": 13876, "page_last": 13887, "abstract": "We present a novel framework based on optimal transport for the challenging problem of comparing graphs. Specifically, we exploit the probabilistic distribution of smooth graph signals defined with respect to the graph topology. This allows us to derive an explicit expression of the Wasserstein distance between graph signal distributions in terms of the graph Laplacian matrices. This leads to a structurally meaningful measure for comparing graphs, which is able to take into account the global structure of graphs, while most other measures merely observe local changes independently. Our measure is then used for formulating a new graph alignment problem, whose objective is to estimate the permutation that minimizes the distance between two graphs. We further propose an efficient stochastic algorithm based on Bayesian exploration to accommodate for the non-convexity of the graph alignment problem. We finally demonstrate the performance of our novel framework on different tasks like graph alignment, graph classification and graph signal prediction, and we show that our method leads to significant improvement with respect to the-state-of-art algorithms.", "full_text": "GOT: An Optimal Transport framework for Graph\n\ncomparison\n\nHermina Petric Maretic\n\nMireille EL Gheche\n\nEcole Polytechnique F\u00e9d\u00e9rale de Lausanne\n\nSignal Processing Laboratory (LTS4)\n\nEcole Polytechnique F\u00e9d\u00e9rale de Lausanne\n\nSignal Processing Laboratory (LTS4)\n\nLausanne, Switzerland\n\nhermina.petricmaretic@epfl.ch\n\nLausanne, Switzerland\n\nmireille.elgheche@epfl.ch\n\nGiovanni Chierchia\n\nUniversit\u00e9 Paris-Est, LIGM (UMR 8049)\n\nCNRS, ENPC, ESIEE Paris, UPEM\nF-93162, Noisy-le-Grand, France\ngiovanni.chierchia@esiee.fr\n\nPascal Frossard\n\nEcole Polytechnique F\u00e9d\u00e9rale de Lausanne\n\nSignal Processing Laboratory (LTS4)\n\nLausanne, Switzerland\n\npascal.frossard@epfl.ch\n\nAbstract\n\nWe present a novel framework based on optimal transport for the challenging\nproblem of comparing graphs. Speci\ufb01cally, we exploit the probabilistic distribution\nof smooth graph signals de\ufb01ned with respect to the graph topology. This allows\nus to derive an explicit expression of the Wasserstein distance between graph\nsignal distributions in terms of the graph Laplacian matrices. This leads to a\nstructurally meaningful measure for comparing graphs, which is able to take into\naccount the global structure of graphs, while most other measures merely observe\nlocal changes independently. Our measure is then used for formulating a new\ngraph alignment problem, whose objective is to estimate the permutation that\nminimizes the distance between two graphs. We further propose an ef\ufb01cient\nstochastic algorithm based on Bayesian exploration to accommodate for the non-\nconvexity of the graph alignment problem. We \ufb01nally demonstrate the performance\nof our novel framework on different tasks like graph alignment, graph classi\ufb01cation\nand graph signal prediction, and we show that our method leads to signi\ufb01cant\nimprovement with respect to the state-of-art algorithms.\n\n1\n\nIntroduction\n\nWith the rapid development of digitisation in various domains, the volume of data increases very\nrapidly, with many of those taking the form of structured data. Such information is often represented\nby graphs that capture potentially complex structures. It stays however pretty challenging to analyse,\nclassify or predict graph data, due to the lack of ef\ufb01cient measures for comparing graphs. In particular,\nthe mere comparison of graph matrices is not necessarily a meaningful distance, as different edges\ncan have a diverse importance in the graph. Spectral distances have also been proposed [1, 2], but\nthey usually do not take into account all the information provided by the graphs, focusing only on the\nLaplacian matrix eigenvectors and ignoring a large portion of the structure encoded in eigenvectors.\nIn addition to the lack of effective distances, a major dif\ufb01culty with graph representations is that their\nnodes may not be aligned, which further complicates graph comparisons.\nIn this paper, we propose a new framework for graph comparison, which permits to compute both the\ndistance between two graphs under unknown permutations, and the transportation plan for data from\none graph to another, under the assumption that the graphs have the same number of vertices. Instead\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fof comparing graph matrices directly, we propose to look at the smooth graph signal distributions\nassociated to each graph, and to relate the distance between graphs to the distance between the graph\nsignal distributions. We resort to optimal transport for computing the Wasserstein distance between\ndistributions, as well as the corresponding transportation plan. Optimal transport (OT) was introduced\nby Monge [3], and reformulated in a more tractable way by Kantorovich [4]. It has been a topic\nof great interest both theoretically and practically [5], and has recently been largely revisited with\nnew applications in image processing, data analysis, and machine learning [6]. Interestingly, the\nWasserstein distance takes a closed-form expression in our settings, which essentially depends on the\nLaplacian matrices of the graphs under comparison. We further show that the Wasserstein distance\nhas the important advantage of capturing the main structural information of the graphs.\nEquipped with this distance, we formulate a new graph alignment problem for \ufb01nding the permutation\nthat minimises the mass transportation between a \"\ufb01xed\" distribution and a \"permuted\" distribution.\nThis yields a nonconvex optimization problem that we solve ef\ufb01ciently with a novel stochastic gradient\ndescent algorithm. It permits to ef\ufb01ciently align and compare graphs, and it outputs a structurally\nmeaningful distance and transport map. These are important elements in graph analysis, comparison,\nor graph signal prediction tasks. We \ufb01nally illustrate the bene\ufb01ts of our new graph comparison\nframework in representative tasks such as noisy graph alignment, graph classi\ufb01cation, and graph\nsignal transfer. Our results show that the proposed distance outperforms both Gromov-Wasserstein\nand Euclidean distance for what concerns the graph alignment and graph clustering. In addition, we\nshow the use of transport maps to predict graph signals. To the best of our knowledge, this is the only\nframework for graph comparison that includes the possibility to adapt graph signals to another graph.\n\n1.1 Related work\n\nIn the literature, many methods have formulated the graph matching as a quadratic assignment\nproblem [7, 8], under the constraint that the solution is a permutation matrix. As this is an NP-hard\nproblem, different relaxations have been proposed to \ufb01nd approximate solutions. In this context,\nspectral clustering [9, 10] emerged as a simple relaxation, which consists of \ufb01nding the orthogonal\nmatrix whose squared entries sum to one, but the drawback is that the matching accuracy is suboptimal.\nTo improve on this behavior, the semi-de\ufb01nite programming relaxation was adopted to tackle the\ngraph matching problem by relaxing the non-convex constraint into a semi-de\ufb01nite one [11]. Spectral\nproperties have also been used to inspect graphs and de\ufb01ne different classes of graphs for which\nthe convex relaxation is equivalent to the original graph maching problem [12] [13]. Other works\nfocus on the general problem and propose provably tight convex relaxations for all graph classes\n[14]. Based on the assumption that the space of doubly-stochastic matrices is a convex hull, the\ngraph matching problem was relaxed into a non-convex quadratic problem in [15, 16]. A related\napproach was recently proposed to approximate discrete graph matching in the continuous domain\nasymptotically by using separable functions [17]. Along similar lines, a Gumbel-sinkhorn network\nwas proposed to infer permutations from data [18, 19]. The approach consists of producing a discrete\npermutation from a continuous doubly-stochastic matrix obtained with the Sinkhorn operator.\nCloser to our framework, some recent works have studied the graph alignment problem from an\noptimal transport perspective. For example, Flamary et al. [20] propose a method based on optimal\ntransport for empirical distributions with a graph-based regularization. The objective of this work\nis to compute an optimal transportation plan by controlling the displacement of a pair of points.\nGraph-based regularization encodes neighborhood similarity between samples on either the \ufb01nal\nposition of the transported samples, or their displacement [21]. Gu et al. [22] de\ufb01ne a spectral\ndistance by assigning a probability measure to the nodes via the spectrum representation of each\ngraph, and by using Wasserstein distances between probability measures. This approach however\ndoes not take into account the full graph structure in the alignment problem. Nikolentzos et al. [23]\nproposed instead to match the graph embeddings, where the latter are represented as bags of vectors,\nand the Wasserstein distance is computed between them. The authors also propose a heuristic to take\ninto account possible node labels or signals.\nAnother line of works have looked at more speci\ufb01c graphs. Memoli [24] investigates the Gromov-\nWasserstein distance for object matching, and Peyr\u00e9 et al. [25] propose an ef\ufb01cient algorithm to\ncompute the Gromov-Wasserstein distance and the barycenter of pairwise dissimilarity matrices. The\nalgorithm uses entropic regularization and Sinkhorn projections, as proposed by [26]. The work has\nmany interesting applications, including multimedia with point-cloud averaging and matching, but\n\n2\n\n\falso natural language processing with alignment of word embedding spaces [27]. Vayer et al. [28]\nbuild on this work and propose a distance for graphs and signals living on them. The problem is given\nas a combination between the Gromov-Wasserstein of graph distance matrices and the Wasserstein\ndistance of graph signals. However, while the above methods solve the alignment problem using\noptimal transport, the simple distances between aligned graphs do not take into account its global\nstructure and the methods do not consider the transportation of signals between graphs.\n\n1.2 Organization\n\nIn this paper, we propose to resort to smooth graph signal distributions in order to compare graphs,\nand develop an effective algorithm to align graphs under a priori unknown permutations. The paper\nis organized as follows. Section 2 details the graph alignment with optimal transport. Section 3\npresents the algorithm for solving the proposed approach via a stochastic gradient technique. Section\n4 provides an experimental validation of graph matching in the context of graph classi\ufb01cation, and\ngraph signal transfer. Finally, the conclusion is given in Section 5.\n\n2 Graph Alignment with Optimal Transport\n\nDespite recent advances in the analysis of graph data, it stays pretty challenging to de\ufb01ne a meaningful\ndistance between graphs. Even more, a major dif\ufb01culty with graph representations is the lack of node\nalignment, which prevents from performing direct quantitative comparisons between graphs. In this\nsection, we propose a new distance based on Optimal Transport (OT) to compare graphs through\nsmooth graph signal distributions. Then, we use this distance to formulate a new graph alignment\nproblem, which aims at \ufb01nding the permutation matrix that minimizes the distance between graphs.\n\n2.1 Preliminaries\nWe denote by G = (V, E) a graph with a set V of N vertices and a set E of edges. The graph\nis assumed to be connected, undirected, and edge weighted. The adjacency matrix is denoted by\nW \u2208 RN\u00d7N . The degree of a vertex i \u2208 V , denoted by d(i), is the sum of weights of all the edges\nincident to i in the graph G. The degree matrix D \u2208 RN\u00d7N is then de\ufb01ned as:\n\n(cid:26)d(i)\n\nif i = j\notherwise.\n\n0\nBased on W and D, the Laplacian matrix of G is\n\nDi,j =\n\nL = D \u2212 W.\n\n(1)\n\n(2)\n\nMoreover, we consider additional attributes modelled as features on the graph vertices. Assuming\nthat each node is associated to a scalar feature, the graph signal takes the form of a vector in RN .\n\n2.2 Smooth graph signals\n\nFollowing [29], we interpret graphs as key elements that drive the probability distributions of signals.\nSpeci\ufb01cally, we consider two graphs G1 and G2 with Laplacian matrices L1 and L2, and we consider\n\u2020\n\u2020\n2 as covariance matrix [30],\n1 or L\nsignals that follow the normal distributions with zero mean and L\nnamely1\n\n\u2020\n\u03bdG1 = N (0, L\n1)\n\u2020\n\u00b5G2 = N (0, L\n2).\n\n(3)\n(4)\n\nThe above formulation means that the graph signal values vary slowly between strongly connected\nnodes [30]. This assumption is veri\ufb01ed for most common graph and network datasets. It is further used\nin many graph inference algorithms implicitly representing a graph through its smooth signals [31\u2013\n33]. Furthermore, the smoothness assumption is used as regularization in many graph applications,\nsuch as robust principal component analysis [34] and label propagation [35].\n\n1Note that \u2020 denotes a pseudoinverse operator.\n\n3\n\n\f(a) G1\n\n(b) G2: (cid:107)L1 \u2212 L2(cid:107)F = 2.828,\n2 (\u03bdG1 , \u00b5G2 ) = 0.912\n\nW 2\n\n(c) G3: (cid:107)L1 \u2212 L3(cid:107)F = 2.828,\n2 (\u03bdG1 , \u00b5G3 ) = 0.013\n\nW 2\n\nFigure 1: Illustration of the structural differences captured with Wasserstein distance between graphs\nde\ufb01ned in (5). The graphs G2 and G3 are both copies of G1, with 2 edges removed. The modi\ufb01cation\nin G2 is very in\ufb02uential, as the two communities are almost disconnected; here, both Frobenius norm\nand Wasserstein distance measure a signi\ufb01cant difference w.r.t. G1. Conversely, the modi\ufb01cation in\nG3 is hardly noticeable; here, the Frobenius norm still measures a signi\ufb01cant difference, whereas the\nWasserstein distance does not. The latter is a desirable property in the context of graph comparison.\n\n2.3 Wasserstein distance between graphs\n\nInstead of comparing graphs directly, we propose to look at the signal distributions, which are\ngoverned by the graphs. Speci\ufb01cally, we measure the dissimilarity between two aligned graphs G1\nand G2 through the Wasserstein distance of the respective distributions \u03bdG1 and \u00b5G2. More precisely,\nthe 2-Wasserstein distance corresponds to the minimal \u201ceffort\u201d required to transport one probability\nmeasure to another with respect to the Euclidean norm [3], that is\n\nW 2\n2\n\n(5)\nwhere T#\u03bdG1 denotes the push forward of \u03bdG1 by the transport map T : X \u2192 X de\ufb01ned on a metric\nspace X . For normal distributions such as \u03bdG1 and \u00b5G2, the 2-Wasserstein distance can be explicitly\nwritten in terms of their covariance matrices [36], yielding\n\nT#\u03bdG1 =\u00b5G2\n\ninf\n\nX\n\n(cid:107)x \u2212 T (x)(cid:107)2 d\u03bdG1(x),\n\n(cid:0)\u03bdG1, \u00b5G2(cid:1) =\n\n(cid:90)\n\n\u2020\n\u2020\n\u2020\n1 L\n2L\n2\n2\n1\n\nL\n\n,\n\n(6)\n\n(cid:0)\u03bdG1 , \u00b5G2(cid:1) = Tr\n\n(cid:16)\n\nW 2\n2\n\n(cid:18)(cid:113)\n(cid:17) \u2212 2 Tr\n(cid:17) \u2020\n\n\u2020\n\u2020\n\u2020\n1 L\n2L\n2\n2\n1\n\n2\n\n\u2020\n1 x.\nL\n2\n\n\u2020\n\u2020\n1 + L\nL\n2\n\n(cid:16)\n\n(cid:19)\n\n\u2020\nand the optimal transportation map is T (x) = L\n2\n1\n\nL\n\nThe Wasserstein distance captures the structural information of the graphs under comparison. It is\nsensitive to differences that cause a global change in the connection between graph components,\nwhile it gives less importance to differences that have a small impact on the whole graph structure.\nIndeed, as graphs are represented through the distribution of smooth signals, the Wasserstein distance\nessentially measures the discrepancy in lower graph frequencies, known to capture the global graph\nstructure. This behaviour is illustrated in Figure 1 by a comparison with a simple distance that is the\nEuclidean norm between the Laplacian matrices of the graphs.2\nMoreover, the optimal transportation map enables the movement of signals from one graph to another.\nThis is a continuous Lipshitz mapping that adapts a graph signal to the distribution of another graph,\nwhile keeping similarity. This results in a simple and ef\ufb01cient prediction of the signal on another\ngraph. Clearly, signals that are more likely in the observed distribution will have a more robust\ntransportation, and different Gaussian signal models (in Equations 3 and 4) might be more appropriate\nfor non-smooth signals [37].\n\n2.4 Graph alignment\n\nEquiped with a measure to compare aligned graphs =of the same size through signal distributions, we\nnow propose a new formulation of the graph alignment problem. It is important to note that the graph\n2Note that in our setting a possible alternative to the Wasserstein distance could be the Kullback-Leibler (KL)\n\ndivergence, whose expression is explicit for normal distributions.\n\n4\n\n\fAlgorithm 1 Approximate solution to the graph alignment problem de\ufb01ned in (8).\nRequire: Graphs G1 and G2\nRequire: Sampling size S \u2208 N, learning rate \u03b3 > 0, and constant \u03c4 > 0\nRequire: Random initialization of matrices \u03b70 and \u03c30\n1: for t = 0, 1, . . . do\n2:\n3:\n\nDraw samples {\u0001(s)\nDe\ufb01ne the stochastic approximation of the cost function as\n\nt }1\u2264s\u2264S from the distribution qunit\n\n(cid:16)\n\nS(cid:88)\n\ns=1\n\nJt(\u03b7t, \u03c3t) =\n\n1\nS\n\nW 2\n\n2\n\n\u03bdG1 , \u00b5\n\nG2S\u03c4 (\u03b7t+\u03c3t(cid:12)\u0001(s)\n\nt\n\n)\n\n(cid:17)\n\ngt \u2190 gradient of Jt evaluated at (\u03b7t, \u03c3t)\n(\u03b7t+1, \u03c3t+1) \u2190 update of (\u03b7t, \u03c3t) using gt\n\n4:\n5:\n6: return P = S\u03c4 (\u03b7\u2217)\n\nsignal distributions depend on the enumeration of nodes chosen to build L1 and L2. While in some\ncases (e.g., dynamically changing graphs, multilayer graphs, etc. . . ) a consistent enumeration can\nbe trivially chosen for all graphs, it generally leads to the challenging problem of estimating an a\npriori unknown permutation between graphs. In our approach, we are given two connected graphs\nG1 and G2, each with N distinct vertices and with different sets of edges. We aim at \ufb01nding the\noptimal transportation map T from G1 to G2. However, the vertices of these graphs are not necessarily\naligned. In order to take all possible enumerations into account, we de\ufb01ne the probability measure of\na permuted representation for the graph G2 as\n\n(7)\nwhere P \u2208 RN\u00d7N is a permutation matrix. Consequently, our graph alignment problem consists in\nG2\n\ufb01nding the permutation that minimizes the mass transportation between \u03bdG1 and \u00b5\nP , which reads\n\n\u2020\n2P ),\n\nG2\n\u00b5\n\nP = N(cid:0)0, (P (cid:62)L2P )\u2020(cid:1) = N (0, P (cid:62)L\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n(cid:17) \u2212 2 Tr\n\n(cid:0)\u03bdG1, \u00b5\n\n(cid:1) = Tr\n\n\u2020\n\u2020\n1 + P T L\nL\n2P\n\n(cid:18)(cid:113)\n\nW 2\n\n(cid:16)\n\n(cid:1)\n\nG2\nP\n\ns.t.\n\n2\n\nP \u2208 [0, 1]N\nP 1N = 1N\n1(cid:62)\nN P = 1N\nP (cid:62)P = IN\u00d7N ,\n\n(cid:19)\n\n\u2020\n\u2020\n\u2020\n1 P T L\n2P L\n2\n2\n1\n\nminimize\nP\u2208RN\u00d7N\n\n(cid:0)\u03bdG1, \u00b5\n\nG2\nP\n\nW 2\n\n(8)\n\nwhere 1N = [1 . . . 1](cid:62) \u2208 RN and IN\u00d7N is the N \u00d7 N identity matrix. According to (3), (6), (7),\nthe above distance boils down to\n\n2\n\n(9)\nThe optimal permutation allows us to compare G1 and G2 when the consistent enumeration of nodes\nis not available. This is however a non-convex optimization problem that cannot be easily solved\nwith standard tools. In the next section, we present an ef\ufb01cient algorithm to tackle this problem.\n\nL\n\n.\n\n3 GOT Algorithm\n\nWe propose to solve the OT-based graph alignment problem described in the previous section via\nstochastic gradient descent. The latter is summarized in Algorithm 1, and its derivation is presented\nin the remaining of this section.\n\n3.1 Optimization\n\nThe main dif\ufb01culty in solving Problem (8) arises from the constraint that P is a permutation matrix,\nsince it leads to a discrete optimization problem with a factorial number of feasible solutions. We\npropose to circumvent this issue through an implicit constraint reformulation. The idea is that the\nconstraints in (8) can be enforced implicitly by using the Sinkhorn operator [38, 26, 39, 18]. Given a\nsquare matrix P \u2208 RN\u00d7N (not necessarily a permutation) and a small constant \u03c4 > 0, the Sinkhorn\n\n5\n\n\f(a) Graph 1\n\n(b) Graph 2\n\n(c) Solution \u00afP to (14)\n\n(d) Matrix S\u03c4 ( \u00afP )\n\nFigure 2: Illustrative example of the graph alignment problem. The solution to (14) is a matrix \u00afP\nwhose rows may be interpreted as assignment log-likelihoods. Applying the Sinkhorn operator to \u00afP\nyields a matrix whose rows are assignment probabilities from Graph 1 (columns) to Graph 2 (rows).\n\noperator S\u03c4 normalizes the rows and columns of exp(P/\u03c4 ) via the multiplication by two diagonal\nmatrices A and B, yielding3\n\nThe diagonal matrices A and B are computed iteratively as follows:\n\nS\u03c4 (P ) = A exp(P/\u03c4 )B.\n\nA[k] = diag(cid:0)P [k]1N\nB[k] = diag(cid:0)1(cid:62)\n\n(cid:1)\u22121\nN A[k]P [k](cid:1)\u22121\n\n(10)\n\n(11)\n(12)\n\n(14)\n\n(13)\nwith P [0] = exp(P/\u03c4 ). It can be shown [18] that the operator S\u03c4 yields a permutation matrix in the\nlimit \u03c4 \u2192 0. Consequently, with a slight abuse of notation (as P no longer denotes a permutation),\nwe can rewrite Problem (8) as follows\n\nP [k+1] = A[k]P [k]B[k],\n\n(cid:0)\u03bdG1 , \u00b5\n\nW 2\n\n2\n\n(cid:1).\n\nG2S\u03c4 (P )\n\nminimize\nP\u2208RN\u00d7N\n\nThe above cost function is differentiable [40], and can be thus optimized by gradient descent. An\nillustrative example of a solution of the proposed approach is presented in Fig. 2.\n\n3.2 Stochastic exploration\n\nProblem (14) is highly nonconvex, which may cause gradient descent to converge towards a local\nminimum. Hence, instead of directly optimizing the cost function in (14), we can optimize its\nexpectation w.r.t. the parameters \u03b8 of some distribution q\u03b8, yielding\nG2S\u03c4 (P )\n\n(cid:0)\u03bdG1, \u00b5\n\n(cid:110)W 2\n\n(cid:1)(cid:111)\n\nEP\u223cq\u03b8\n\nminimize\n\n(15)\n\n2\n\n.\n\n\u03b8\n\nThe optimization of the expectation w.r.t. the parameters \u03b8 aims at shaping the distribution q\u03b8 so as\nto put all its mass on a minimizer of the original cost function, thus integrating the use of Bayesian\nexploration in the optimization process.\nA standard choice for q\u03b8 in continuous optimization is the multivariate normal distribution, thus\n\nterization trick [41, 42], which boils down to the equivalence\n\n(cid:0)\u2200(i, j) \u2208 {1, . . . , N}2(cid:1)\n\ni,j N(cid:0)\u03b7ij, \u03c32\nleading to \u03b8 = (\u03b7, \u03c3) \u2208 RN\u00d7N \u00d7 RN\u00d7N and q\u03b8 =(cid:81)\n(cid:1) \u21d4\nPij \u223c N(cid:0)\u03b7ij, \u03c32\n(cid:110)W 2\n(cid:0)\u03bdG1, \u00b5\n(cid:124)\n(cid:123)(cid:122)\n\nthe above problem can be reformulated as4\n\nE\u0001\u223cqunit\n\n(cid:1). By leveraging the reparame-\n(cid:26)\u0001ij \u223c N (0, 1)\n(cid:1)(cid:111)\n(cid:125)\n\nPij = \u03b7ij + \u03c3ij\u0001ij,\n\nG2S\u03c4 (\u03b7+\u03c3(cid:12)\u0001)\n\nminimize\n\n(16)\n\n(17)\n\n\u03b7,\u03c3\n\nij\n\nij\n\n2\n\n,\n\nJ(\u03b7,\u03c3)\n\n3Note that exp is applied element-wise to ensure the positivity of the matrix entries.\n4Note that (cid:12) is the entry-wise (Hadamard) product between matrices.\n\n6\n\n\f(a) Plot of f (t) = \u2212sinc(t)\nFigure 3:\nIllustrative example of stochastic exploration. The white circles mark the iterates\n(\u03b70, \u03c30), . . . , (\u03b7\u2217, \u03c3\u2217) produced by optimizing J (the expectation of f) via stochastic gradient descent.\nAs this optimization is performed in the space of parameters \u03b7 and \u03c3 (see the right panel), the\nalgorithm avoids local minima and successfully converges to the global minimum of both J and f.\n\n(b) Contours of J(\u03b7, \u03c3) = E\n\nt\u223cN (\u03b7,\u03c32)\n\n(cid:8)f (t)(cid:9)\n\nwhere qunit =(cid:81)\n\ni,j N (0, 1) denotes the multivariate normal distribution with zero mean and unitary\nvariance. The advantage of this reformulation is that the gradient of the above stochastic function can\nbe approximated by sampling from the parameterless distribution qunit, yielding\n\n\u2207W 2\n\n2 (\u03bdG1 , \u00b5\n\nG2S\u03c4 (\u03b7+\u03c3(cid:12)\u0001)).\n\n(18)\n\n\u2207J(\u03b7, \u03c3) \u2248 (cid:88)\n\n\u0001\u223cqunit\n\nThe problem can be thus solved by stochastic gradient descent [43]. An illustrative application of\nthis approach on a simple one-dimensional nonconvex function is presented in Fig. 3. Under mild\nassumptions, the algorithm converges almost surely to a critical point, which is not guaranteed to be\nthe global minimum, as the problem is nonconvex.\nThe computational complexity of the naive implementation is O(N 3) per iteration, due to the matrix\nsquare-root operation based on a singular value decomposition (SVD). A better option consists of\napproximating the matrix square-root with the Newton\u2019s method [44]. These iterations only involve\nmatrix multiplications, which can take advantage of the matrix sparsity, thus resulting in a faster\nimplementation than SVD. Moreover, the computation of pseudo-inverses can be avoided by adding\na small diagonal shift to the Laplacian matrices and directly computing the inverse matrices, which is\norders of magnitude faster. This is not a large concern though, as it can be done in preprocessing and\nonly needs to be done once. Finally, the algorithm was implemented using automatic differentiation\n(in PyTorch with AMSGrad [45]).\n\n4 Experimental results\n\nWe illustrate the behaviour of our approach, named GOT, in terms of both distance metric computation\nand transportation map inference. We show how, due to the ability of our distance metric to strongly\ncapture structural properties, it can be bene\ufb01cial in computing alignment between structured graphs\neven when they are very different. For similar reasons, the metric is able to properly separate instances\nof random graphs according to their original model. Finally, we show illustrations of the use of\ntransportation maps for signal prediction in simple image classes.\nPrior to running experiments, we chose the parameters \u03c4 (Sinkhorn) and \u03b3 (learning rate) with grid\nsearch, while S (sampling size) was \ufb01xed empirically. In all experiments, we set \u03c4 = 5, \u03b3 = 0.2, and\nS = 30. We set the maximal number of Sinkhorn iterations to 10, and we run stochastic gradient\ndescent for 3000 iterations (even though the algorithm converges long before, after around 1000\niterations, typically). As our algorithm seems robust to different initialisation, we used random\ninitialisation in all our experiments. The code is available at https://github.com/Hermina/GOT.\n\n4.1 Alignment of structured graphs\n\nWe generate a stochastic block model graph with 40 nodes and 4 communities. A noisy version of\nthis graph is created by randomly removing edges within communities with probability p = 0.5, and\n\n7\n\n\fFigure 4: Alignment and community detection performance for distorted stochastic block model\ngraphs as a function of the edge removal probability. The \ufb01rst three plots show different error\nmeasures (closer to 0 the better); the last one shows the community detection performance in terms\nof Normalized Mutual Information (NMI closer to 1 the better).\n\nFigure 5: Confusion matrices for 1-NN classi\ufb01cation results on random graph models. Rows represent\nactual classes, while columns are predicted classes: SBM2, SBM3, RG, BA, WS respectively.\n\nedges between communities with increasing probabilities p \u2208 [0, 0.6]. We then generate a random\npermutation to change the order of nodes in the noisy graph. We investigate the in\ufb02uence of a distance\nmetric on alignment recovery. We compare three different methods for graph alignment, namely the\nproposed method based on the suggested Wasserstein distance between graphs (GOT), the proposed\nstochastic algorithm with the Euclidean distance (L2), and the state-of-the-art Gromov-Wasserstein\ndistance [25] [28] for graphs (GW), based on the Euclidean distance between shortest path matrices,\nas proposed in [28]. We repeat each experiment 50 times, after adjusting parameters for all compared\nmethods, and show the results in Figure 4.\nApart from analysing the distance between aligned graphs with all three error measures, we also\nevaluate the structural recovery of these community-based models by inspecting the normalized\nmutual information (NMI) for community detection. While GW slightly outperforms GOT in terms\nof its own error measure, GOT clearly performs better in terms of all other inspected metrics. In\nparticular, the last plot shows that the structural information is well captured in GOT, and communities\nare successfully recovered even when the graphs contain a large amount of introduced perturbations.\n\n4.2 Graph classi\ufb01cation\n\nWe tackle the task of graph classi\ufb01cation on random graph models. We create 100 graphs following\n\ufb01ve different models (20 per model), namely Stochastic Block Model [46] with 2 blocks (SBM2),\nStochastic Block Model with 3 blocks (SBM3), random regular graph (RG) [47], Barabasy-Albert\nmodel (BA) [48], and Watts-Strogatz model (WS) [49]. All graphs have 20 nodes and a similar\nnumber of edges to make the task more meaningful, and are randomly permuted. We use GOT to\nalign graphs, and eventually use a simple non-parametric 1-NN classi\ufb01cation algorithm to classify\ngraphs. We compare to several methods for graph alignment: GW [25, 28], FGM [50], IPFP [51],\nRRWM [15] and NetLSD[52]. We present the results in terms of confusion matrices in Figure 5,\naccompanied with their accuracy scores. GOT clearly outperforms the other methods in terms of\ngeneral accuracy, with GW and RRWM also performing well, but having more dif\ufb01culties with SBMs\nand the WS model. This once again suggests that GOT is able to capture structural information of\ngraphs.\n\n4.3 Graph signal transportation\n\nFinally, we look at the relevance of the transportation plans produced by GOT in illustrative experi-\nments with simple images. We use the MNIST dataset, which contains around 60000 images of size\n\n8\n\n\fFigure 6: First two rows: Original \u201czero\u201d digits in MNIST dataset, and their images transported to\ngraphs of different digits. The transported digits in each row follow the inclination of the original\nzero digit. Last two rows: Original \u201cShirt\u201d images in Fashion MNIST dataset, and their images\ntransported to the graphs of other classes (\u201cT-shirt\u201d, \u201cTrouser\u201d, \u201cPullover\u201d, \u201cDress\u201d, \u201cCoat\u201d, \u201cSandal\u201d,\n\u201cSneaker\u201d,\u201cBag\u201d, \u201cAnkle boot\u201d).\n\n28\u00d7 28 displaying handwritten digits from 0 to 9, with 6000 per class. For each class c \u2208 {0, . . . , 9},\nwe stack all the available images into a feature matrix of size 6000 \u00d7 784, and we build a graph over\nthe resulting 784 feature vectors. To construct a graph, we \ufb01rst create a 20-nearest-neighbour binary\ngraph, which we then square (multiply with itself) to obtain the \ufb01nal graph, capturing 2-hop distances\nand creating more meaningful weights. Hence, each class of digits is represented by a graph of 784\nnodes (i.e., image pixels), yielding 9 aligned graphs Gzero,Gone, . . . ,Gnine.\nEach image of a given class can be seen as a smooth signal x \u2208 R784 that lives on the corresponding\ngraph. A transportation plan T is then constructed between the source graph (e.g., Gzero) and all\nother graphs (e.g., Gone, Gtwo, . . . , Gnine). Figure 6 shows two original \u201czero\" digits with different\ninclination, transported to the graphs of all other digits. We can see that the predicted digits are\nrecognisable, because they are adapted to their corresponding graphs, and they further keep the\nsimilarity with the original digit in terms of inclination.\nWe repeated the same experiment on Fashion MNIST, and reported the results in Figure 6. By\ntransporting a \u201cShirt\u201d image to the graphs of classes \u201cT-shirt\u201d, \u201cTrouser\u201d, \u201cPullover\u201d, \u201cDress\u201d,\n\u201cCoat\u201d, \u201cSandal\u201d, \u201cSneaker\u201d, \u201cBag\u201d, \u201cAnkle boot\u201d, we can remark that the predicted images are still\nrecognisable with a good degree of \ufb01delity. Furthermore, we observe that the white shirt translates to\nwhite clothing items, while the textured shirt leads to textured items. This experiment con\ufb01rms the\npotential of GOT in graph signal prediction through adaptation of a graph signal to another graph.\n\n5 Conclusion\n\nWe presented an optimal transport based approach for computing the distance between two graphs\nand the associated transportation plan. Equipped with this distance, we formulated the problem of\n\ufb01nding the permutation between two unaligned graphs, and we proposed to solve it with a novel\nstochastic gradient descent algorithm. We evaluated the proposed approach in the context of graph\nalignment, graph classi\ufb01cation, and graph signal transportation. Our experiments con\ufb01rmed that\nGOT can ef\ufb01ciently capture the structural information of graphs, and the proposed transportation\nplan leads to promising results for the transfer of signals from one graph to another.\n\n6 Acknowledgment\n\nGiovanni Chierchia was supported by the CNRS INS2I JCJC project under grant 2019OSCI.\n\n9\n\n\fReferences\n[1] I. Jovanovi\u00b4c and Z. Stani\u00b4c. Spectral distances of graphs. Linear Algebra and its Applications,\n\n436(5):1425 \u2013 1435, 2012.\n\n[2] R. Gera, L. Alonso, B. Crawford, J. House, J. A. Mendez-Bermudez, T. Knuth, and R. Miller.\nIdentifying network structure similarity using spectral graph theory. Applied Network Science,\n3(1):2, January 2018.\n\n[3] M. Monge. M\u00e9moire sur la th\u00e9orie des d\u00e9blais et des remblais. De l\u2019Imprimerie Royale, 1781.\n\n[4] L. Kantorovich. On the transfer of masses: Doklady akademii nauk ussr. pages 227\u2013229, 1942.\n\n[5] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media,\n\n2008.\n\n[6] G. Peyr\u00e9 and M. Cuturi. Computational optimal transport. Preprint arXiv:1803.00567, 2018.\n\n[7] J. Yan, X. Yin, W. Lin, C. Deng, H. Zha, and X. Yang. A short survey of recent advances in\ngraph matching. In International Conference on Multimedia Retrieval, pages 167\u2013174, New\nYork, NY, USA, 2016. ACM.\n\n[8] B. Jiang, J. Tang, C. Ding, Y. Gong, and B. Luo. Graph matching via multiplicative update\nalgorithm. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,\nand R. Garnett, editors, Advances in Neural Information Processing Systems, pages 3187\u20133195.\nCurran Associates, Inc., 2017.\n\n[9] T. Caelli and S. Kosinov. An eigenspace projection clustering method for inexact graph matching.\n\nIEEE transactions on Pattern Analysis and Machine Intelligence, 26(4):515\u2013519, 2004.\n\n[10] P. Srinivasan, T. Cour, and J. Shi. Balanced graph matching. In B. Sch\u00f6lkopf, J. C. Platt, and\nT. Hoffman, editors, Advances in Neural Information Processing Systems, pages 313\u2013320. MIT\nPress, 2007.\n\n[11] C. Schellewald and C. Schn\u00f6rr. Probabilistic subgraph matching based on convex relaxation. In\nAnand Rangarajan, Baba Vemuri, and Alan L. Yuille, editors, Energy Minimization Methods in\nComputer Vision and Pattern Recognition, pages 171\u2013186, Berlin, Heidelberg, 2005. Springer\nBerlin Heidelberg.\n\n[12] Yonathan A\ufb02alo, Alexander Bronstein, and Ron Kimmel. On convex relaxation of graph\n\nisomorphism. Proceedings of the National Academy of Sciences, 112(10):2942\u20132947, 2015.\n\n[13] Marcelo Fiori and Guillermo Sapiro. On spectral properties for graph matching and graph\nisomorphism problems. Information and Inference: A Journal of the IMA, 4(1):63\u201376, 2015.\n\n[14] Nadav Dym, Haggai Maron, and Yaron Lipman. Ds++: A \ufb02exible, scalable and provably tight\n\nrelaxation for matching problems. arXiv preprint arXiv:1705.06148, 2017.\n\n[15] M. Cho, J. Lee, and K. M. Lee. Reweighted random walks for graph matching. In European\n\nconference on Computer vision, pages 492\u2013505. Springer, 2010.\n\n[16] F. Zhou and F. D. Torre. Factorized graph matching. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 38(9):1774\u20131789, Sep. 2016.\n\n[17] T. Yu, J. Yan, Y. Wang, W. Liu, and B. Li. Generalizing graph matching beyond quadratic\nassignment model. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,\nand R. Garnett, editors, Advances in Neural Information Processing Systems, pages 853\u2013863.\nCurran Associates, Inc., 2018.\n\n[18] G. Mena, D. Belanger, S. Linderman, and J. Snoek. Learning latent permutations with gumbel-\n\nsinkhorn networks. In International Conference on Learning Representations, 2018.\n\n[19] P. Emami and S. Ranka. Learning permutations with sinkhorn policy gradient. Preprint\n\narXiv:1805.07010, 2018.\n\n10\n\n\f[20] R. Flamary, N. Courty, A. Rakotomamonjy, and D. Tuia. Optimal transport with Laplacian\nregularization. In NIPS 2014, Workshop on Optimal Transport and Machine Learning, Montr\u00e9al,\nCanada, December 2014.\n\n[21] S. Ferradans, N. Papadakis, J. Rabin, G. Peyr\u00e9, and J.-F. Aujol. Regularized discrete optimal\nIn A. Kuijper, K. Bredies, T. Pock, and H. Bischof, editors, Scale Space and\ntransport.\nVariational Methods in Computer Vision, pages 428\u2013439, Berlin, Heidelberg, 2013. Springer\nBerlin Heidelberg.\n\n[22] J. Gu, B. Hua, and S. Liu. Spectral distances on graphs. Discrete Applied Mathematics, 190-191:\n\n56 \u2013 74, 2015.\n\n[23] G. Nikolentzos, P. Meladianos, and M. Vazirgiannis. Matching node embeddings for graph\n\nsimilarity. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[24] F. M\u00e9moli. Gromov\u2013wasserstein distances and the metric approach to object matching. Foun-\n\ndations of computational mathematics, 11(4):417\u2013487, 2011.\n\n[25] G. Peyr\u00e9, M. Cuturi, and Solomon J. Gromov-wasserstein averaging of kernel and distance\nmatrices. In Maria Florina Balcan and Kilian Q. Weinberger, editors, International Conference\non Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2664\u2013\n2672, New York, New York, USA, 20\u201322 Jun 2016.\n\n[26] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C. J. C. Burges,\nL. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural\nInformation Processing Systems, pages 2292\u20132300. Curran Associates, Inc., 2013.\n\n[27] D. Alvarez-Melis and T. S. Jaakkola. Gromov-wasserstein alignment of word embedding spaces.\n\nPreprint arXiv:1809.00013, 2018.\n\n[28] T. Vayer, L. Chapel, R. Flamary, R. Tavenard, and N. Courty. Optimal transport for structured\n\ndata. Preprint arXiv:1805.09114, 2018.\n\n[29] Havard Rue and Leonhard Held. Gaussian Markov random \ufb01elds: theory and applications.\n\nChapman and Hall/CRC, 2005.\n\n[30] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst. Learning laplacian matrix in smooth\ngraph signal representations. IEEE Transactions on Signal Processing, 64(23):6160\u20136173,\n2016.\n\n[31] A. P. Dempster. Covariance selection. Biometrics, pages 157\u2013175, 1972.\n\n[32] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the\n\ngraphical lasso. Biostatistics, 9(3):432\u2013441, 2008.\n\n[33] X. Dong, D. Thanou, M. Rabbat, and P. Frossard. Learning graphs from data: A signal\n\nrepresentation perspective. Preprint arXiv:1806.00848, 2018.\n\n[34] N. Shahid, V. Kalofolias, X. Bresson, M. Bronstein, and P. Vandergheynst. Robust principal\nIn Proceedings of the IEEE International Conference on\n\ncomponent analysis on graphs.\nComputer Vision, pages 2812\u20132820, 2015.\n\n[35] X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-supervised learning using gaussian \ufb01elds and\nharmonic functions. In International conference on Machine learning, pages 912\u2013919, 2003.\n\n[36] A. Takatsu. Wasserstein geometry of gaussian measures. Osaka Journal of Mathematics, 48(4):\n\n1005\u20131026, 2011.\n\n[37] S. Segarra, A. G. Marques, G. Mateos, and A. Ribeiro. Network topology inference from\nspectral templates. IEEE Transactions on Signal and Information Processing over Networks, 3\n(3):467\u2013483, September 2017.\n\n[38] R. Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices.\n\nThe Annals of Mathematical Statistics, 35(2):876\u2013879, 1964.\n\n11\n\n\f[39] A. Genevay, G Peyr\u00e9, and M. Cuturi. Learning generative models with sinkhorn divergences. In\nAmos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First International\nConference on Arti\ufb01cial Intelligence and Statistics, volume 84 of Proceedings of Machine\nLearning Research, pages 1608\u20131617, Playa Blanca, Lanzarote, Canary Islands, 09\u201311 Apr\n2018.\n\n[40] G. Luise, A. Rudi, M. Pontil, and C. Ciliberto. Differential properties of sinkhorn approximation\nfor learning with wasserstein distance. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,\nN. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems,\npages 5859\u20135870. 2018.\n\n[41] D. P. Kingma and M. Welling. Auto-encoding variational bayes. preprint arXiv:1312.6114,\n\n2014.\n\n[42] M. Figurnov, S. Mohamed, and A. Mnih. Implicit reparameterization gradients. In S. Bengio,\nH. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 31, pages 441\u2013452. Curran Associates, Inc., 2018.\n\n[43] M. E. Khan, W. Lin, V. Tangkaratt, Z. Liu, and D. Nielsen. Variational adaptive-newton method\n\nfor explorative learning. Preprint arXiv:1711.05560, 2017.\n\n[44] T.-Y. Lin and S. Maji. Improved bilinear pooling with CNNs. In British Machine Vision\n\nConference, London, UK, September 2017.\n\n[45] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In\nInternational Conference on Learning Representations, 2018. URL https://openreview.\nnet/forum?id=ryQu7f-RZ.\n\n[46] P. W Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social\n\nnetworks, 5(2):109\u2013137, 1983.\n\n[47] A. Steger and N. C. Wormald. Generating random regular graphs quickly. Combinatorics,\n\nProbability and Computing, 8(4):377\u2013396, 1999.\n\n[48] A.-L. Barab\u00e1si and R. Albert. Emergence of scaling in random networks. Science, 286(5439):\n\n509\u2013512, 1999.\n\n[49] D. J. Watts and S. H. Strogatz. Collective dynamics of \u2018small-world\u2019networks. Naturevolume,\n\n393:440\u2013442, June 1998.\n\n[50] F. Zhou and F. De la Torre. Deformable graph matching. In IEEE Conference on Computer\n\nVision and Pattern Recognition, pages 2922\u20132929, June 2013.\n\n[51] M. Leordeanu, M. Hebert, and R. Sukthankar. An integer projected \ufb01xed point method for graph\nmatching and map inference. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and\nA. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1114\u20131122.\nCurran Associates, Inc., 2009.\n\n[52] Anton Tsitsulin, Davide Mottin, Panagiotis Karras, Alexander Bronstein, and Emmanuel M\u00fcller.\nNetlsd: hearing the shape of a graph. In Proceedings of the 24th ACM SIGKDD International\nConference on Knowledge Discovery & Data Mining, pages 2347\u20132356. ACM, 2018.\n\n12\n\n\f", "award": [], "sourceid": 7760, "authors": [{"given_name": "Hermina", "family_name": "Petric Maretic", "institution": "EPFL"}, {"given_name": "Mireille", "family_name": "El Gheche", "institution": "EPFL"}, {"given_name": "Giovanni", "family_name": "Chierchia", "institution": "ESIEE Paris"}, {"given_name": "Pascal", "family_name": "Frossard", "institution": "EPFL"}]}