{"title": "Bayesian Semi-supervised Learning with Graph Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1683, "page_last": 1694, "abstract": "We propose a data-efficient Gaussian process-based Bayesian approach to the semi-supervised learning problem on graphs. The proposed model shows extremely competitive performance when compared to the state-of-the-art graph neural networks on semi-supervised learning benchmark experiments, and outperforms the neural networks in active learning experiments where labels are scarce. Furthermore, the model does not require a validation data set for early stopping to control over-fitting. Our model can be viewed as an instance of empirical distribution regression weighted locally by network connectivity. We further motivate the intuitive construction of the model with a Bayesian linear model interpretation where the node features are filtered by an operator related to the graph Laplacian. The method can be easily implemented by adapting off-the-shelf scalable variational inference algorithms for Gaussian processes.", "full_text": "Bayesian Semi-supervised Learning with\n\nGraph Gaussian Processes\n\nYin Cheng Ng1, Nicol\u00f2 Colombo1, Ricardo Silva1,2\n\n1Statistical Science, University College London\n\n2The Alan Turing Institute\n\n{y.ng.12, nicolo.colombo, ricardo.silva}@ucl.ac.uk\n\nAbstract\n\nWe propose a data-ef\ufb01cient Gaussian process-based Bayesian approach to the semi-\nsupervised learning problem on graphs. The proposed model shows extremely\ncompetitive performance when compared to the state-of-the-art graph neural net-\nworks on semi-supervised learning benchmark experiments, and outperforms the\nneural networks in active learning experiments where labels are scarce. Further-\nmore, the model does not require a validation data set for early stopping to control\nover-\ufb01tting. Our model can be viewed as an instance of empirical distribution\nregression weighted locally by network connectivity. We further motivate the intu-\nitive construction of the model with a Bayesian linear model interpretation where\nthe node features are \ufb01ltered by an operator related to the graph Laplacian. The\nmethod can be easily implemented by adapting off-the-shelf scalable variational\ninference algorithms for Gaussian processes.\n\n1\n\nIntroduction\n\nData sets with network and graph structures that describe the relationships between the data points\n(nodes) are abundant in the real world. Examples of such data sets include friendship graphs on social\nnetworks, citation networks of academic papers, web graphs and many others. The relational graphs\noften provide rich information in addition to the node features that can be exploited to build better\npredictive models of the node labels, which can be costly to collect. In scenarios where there are not\nenough resources to collect suf\ufb01cient labels, it is important to design data-ef\ufb01cient models that can\ngeneralize well with few training labels. The class of learning problems where a relational graph of\nthe data points is available is referred to as graph-based semi-supervised learning in the literature\n[7, 47].\nMany of the successful graph-based semi-supervised learning models are based on graph Laplacian\nregularization or learning embeddings of the nodes. While these models have been widely adopted,\ntheir predictive performance leaves room for improvement. More recently, powerful graph neural\nnetworks that surpass Laplacian and embedding based methods in predictive performance have\nbecome popular. However, neural network models require relatively larger number of labels to\nprevent over-\ufb01tting and work well. We discuss the existing models for graph-based semi-supervised\nlearning in detail in Section 4.\nWe propose a new Gaussian process model for graph-based semi-supervised learning problems that\ncan generalize well with few labels, bridging the gap between the simpler models and the more data\nintensive graph neural networks. The proposed model is also competitive with graph neural networks\nin settings where there are suf\ufb01cient labelled data. While posterior inference for the proposed model\nis intractable for classi\ufb01cation problems, scalable variational inducing point approximation method\nfor Gaussian processes can be directly applied to perform inference. Despite the potentially large\nnumber of inducing points that need to be optimized, the model is protected from over-\ufb01tting by the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fvariational lower bound, and does not require a validation data set for early stopping. We refer to the\nproposed model as the graph Gaussian process (GGP).\n\n2 Background\n\nIn this section, we brie\ufb02y review key concepts in Gaussian processes and the relevant variational\napproximation technique. Additionally, we review the graph Laplacian, which is relevant to the\nalternative view of the model that we describe in Section 3.1. This section also introduces the notation\nused across the paper.\n\n2.1 Gaussian Processes\n\nA Gaussian process f (x) (GP) is an in\ufb01nite collection of random variables, of which any \ufb01nite subset\nis jointly Gaussian distributed. Consequently, a GP is completely speci\ufb01ed by its mean function m(x)\nand covariance kernel function k\u03b8(x, x(cid:48)), where x, x(cid:48) \u2208 X denote the possible inputs that index\nthe GP and \u03b8 is a set of hyper-parameters parameterizing the kernel function. We denote the GP as\nfollows\n\nf (x) \u223c GP(cid:0)m(x), k\u03b8(x, x(cid:48))(cid:1).\n\n(1)\nGPs are widely used as priors on functions in the Bayesian machine learning literatures because\nof their wide support, posterior consistency, tractable posterior in certain settings and many other\ngood properties. Combined with a suitable likelihood function as speci\ufb01ed in Equation 2, one can\nconstruct a regression or classi\ufb01cation model that probabilistically accounts for uncertainties and\ncontrol over-\ufb01tting through Bayesian smoothing. However, if the likelihood is non-Gaussian, such as\nin the case of classi\ufb01cation, inferring the posterior process is analytically intractable and requires\napproximations. The GP is connected to the observed data via the likelihood function\n\nyn | f (xn) \u223c p(yn|f (xn))\n\n(2)\nThe positive de\ufb01nite kernel function k\u03b8(x, x(cid:48)) : X \u00d7 X \u2212\u2192 R is a key component of GP that speci\ufb01es\nthe covariance of f (x) a priori. While k\u03b8(x, x(cid:48)) is typically directly speci\ufb01ed, any kernel function\ncan be expressed as the inner product of features maps (cid:104)\u03c6(x), \u03c6(x(cid:48))(cid:105)H in the Hilbert space H.\nThe dependency of the feature map on \u03b8 is implicitly assumed for conciseness. The feature map\n\u03c6(x) : X \u2212\u2192 H projects x into a typically high-dimensional (possibly in\ufb01nite) feature space such\nthat linear models in the feature space can model the target variable y effectively. Therefore, GP can\nequivalently be formulated as\n\n(3)\nwhere w is assigned a multivariate Gaussian prior distribution and marginalized. In this paper, we\nassume the index set to be X = RD\u00d71 without loss of generality.\nFor a detailed review of the GP and the kernel functions, please refer to [45].\n\nf (x) = \u03c6(x)Tw,\n\n2.1.1 Scalable Variational Inference for GP\n\n\u2200n \u2208 {1, . . . , N}.\n\nDespite the \ufb02exibility of the GP prior, there are two major drawbacks that plague the model. First,\nif the likelihood function in Equation 2 is non-Gaussian, posterior inference cannot be computed\nanalytically. Secondly, the computational complexity of the inference algorithm is O(N 3) where N\nis the number of training data points, rendering the model inapplicable to large data sets.\nFortunately, modern variational inference provides a solution to both problems by introducing a\nset of M inducing points Z = [z1, . . . , zM ]T, where zm \u2208 RD\u00d71. The inducing points, which are\nvariational parameters, index a set of random variables u = [f (z1), . . . , f (zM )]T that is a subset of\nthe GP function f (x). Through conditioning and assuming m(x) is zero, the conditional GP can be\nexpressed as\n(4)\nwhere kzx = [k\u03b8(z1, x), . . . , k\u03b8(zM , x)] and [Kzz]ij = k\u03b8(zi, zj). Naturally, p(u) = N (0, Kzz).\nThe variational posterior distribution of u, q(u) is assumed to be a multivariate Gaussian distribution\nwith mean m and covariance matrix S. Following the standard derivation of variational inference,\nthe Evidence Lower Bound (ELBO) objective function is\n\nzz u, k\u03b8(x, x) \u2212 kT\n\nf (x) | u \u223c GP(kT\n\nzxK\u22121\n\nzxK\u22121\n\nzz kzx)\n\nL(\u03b8, Z, m, S) =\n\nEq(f (xn))[log p(yn|f (xn))] \u2212 KL[q(u)||p(u)].\n\n(5)\n\nN(cid:88)\n\nn=1\n\n2\n\n\fThe variational distribution q(f (xn)) can be easily derived from the conditional GP in Equation 4\nand q(u), and its expectation can be approximated effectively using 1-dimensional quadratures. We\nrefer the readers to [30] for detailed derivations and results.\n\n2.2 The Graph Laplacian\nGiven adjacency matrix A \u2208 {0, 1}N\u00d7N of an undirected binary graph G = (V,E) without self-loop,\nthe corresponding graph Laplacian is de\ufb01ned as\n\n(6)\nwhere D is the N \u00d7 N diagonal node degree matrix. The graph Laplacian can be viewed as an\noperator on the space of functions g : V \u2212\u2192 R indexed by the graph\u2019s nodes such that\n\nL = D \u2212 A,\n\n(cid:88)\n\nLg(n) =\n\nv\u2208N e(n)\n\n[g(n) \u2212 g(v)],\n\n(7)\n\nwhere N e(n) is the set containing neighbours of node n. Intuitively, applying the Laplacian operator\nto the function g results in a function that quanti\ufb01es the variability of g around the nodes in the graph.\nThe Laplacian\u2019s spectrum encodes the geometric properties of the graph that are useful in crafting\ngraph \ufb01lters and kernels [37, 43, 4, 9]. As the Laplacian matrix is real symmetric and diagonalizable,\nits eigen-decomposition exists. We denote the decomposition as\n\nL = U\u039bUT,\n\n(8)\nwhere the columns of U \u2208 RN\u00d7N are the eigenfunctions of L and the diagonal \u039b \u2208 RN\u00d7N contains\nthe corresponding eigenvalues. Therefore, the Laplacian operator can also be viewed as a \ufb01lter on\nfunction g re-expressed using the eigenfunction basis. Regularization can be achieved by directly\nmanipulating the eigenvalues of the system [39]. We refer the readers to [4, 37, 9] for comprehensive\nreviews of the graph Laplacian and its spectrum.\n\n3 Graph Gaussian Processes\n\nGiven a data set of size N with D-dimensional features X = [x1, . . . , xN ]T, a symmetric binary\nadjacency matrix A \u2208 {0, 1}N\u00d7N that represents the relational graph of the data points and labels\nfor a subset of the data points, Yo = [y1, . . . , yO], with each yi \u2208 {1, . . . , K}, we seek to predict the\nunobserved labels of the remaining data points YU = [yO+1, . . . yN ]. We denote the set of all labels\nas Y = YO \u222a YU .\nThe GGP speci\ufb01es the conditional distribution p\u03b8(Y|X, A), and predicts YU via the predictive\ndistribution p\u03b8(YU|YO, X, A). The joint model is speci\ufb01ed as the product of the conditionally\nindependent likelihood p(yn|hn) and the GGP prior p\u03b8(h|X, A) with hyper-parameters \u03b8. The latent\nlikelihood parameter vector h \u2208 RN\u00d71 is de\ufb01ned in the next paragraph.\nFirst, the model factorizes as\n\nN(cid:89)\n\nn=1\n\np\u03b8(Y, h|X, A) = p\u03b8(h|X, A)\n\np(yn|hn),\n\n(9)\n\nwhere for the multi-class classi\ufb01cation problem that we are interested in, p(yn | hn) is given by the\nrobust-max likelihood [30, 16, 23, 21, 20].\nNext, we construct the GGP prior from a Gaussian process distributed latent function f (x) : RD\u00d71 \u2212\u2192\n\nR, f (x) \u223c GP(cid:0)0, k\u03b8(x, x(cid:48))(cid:1), where the key assumption is that the likelihood parameter hn for data\n\npoint n is an average of the values of f over its 1-hop neighbourhood N e(n) as given by A:\n\nf (xn) +(cid:80)\n\nl\u2208N e(n) f (xl)\n\n(10)\nwhere N e(n) = {l : l \u2208 {1, . . . , N}, Anl = 1}, Dn = |N e(n)|. We further motivate this key\nassumption in Section 3.1.\n\n1 + Dn\n\nhn =\n\n3\n\n\fAs f (x) has a zero mean function, the GGP prior can be succinctly expressed as a multivariate\nGaussian random \ufb01eld\n\np\u03b8(h|X, A) = N (0, PKXXPT),\n\n(11)\nwhere P = (I + D)\u22121(I + A) and [KXX]ij = k\u03b8(xi, xj). A suitable kernel function k\u03b8(xi, xj) for\nthe task at hand can be chosen from the suite of well-studied existing kernels, such as those described\nin [13]. We refer to the chosen kernel function as the base kernel of the GGP. The P matrix is\nsometimes known as the random-walk matrix in the literatures [9]. A graphical model representation\nof the proposed model is shown in Figure 1.\n\nFigure 1: The \ufb01gure depicts a relational graph (left) and the corresponding GGP represented as a\ngraphical model (right). The thick circle represents a set of fully connected nodes.\n\nThe covariance structure speci\ufb01ed in Equation 11 is equivalent to the pairwise covariance\n\n(cid:88)\n\n(cid:88)\n\nCov(hm, hn) =\n\n(1 + Dm)(1 + Dn)\n\ni\u2208{m\u222aN e(m)}\n\n1\n\n(cid:88)\n\n= (cid:104)\n\n1\n\n1 + Dm\n\ni\u2208{m\u222aN e(m)}\n\n1 + Dn\n\nj\u2208{n\u222aN e(n)}\n\n\u03c6(xi),\n\nk\u03b8(xi, xj)\n\nj\u2208{n\u222aN e(n)}\n1\n\n(cid:88)\n\n\u03c6(xj)(cid:105)H\n\n(12)\n\nwhere \u03c6(\u00b7) is the feature map that corresponds to the base kernel k\u03b8(\u00b7,\u00b7). Equation 12 can be viewed\nas the inner product between the empirical kernel mean embeddings that correspond to the bags of\nnode features observed in the 1-hop neighborhood sub-graphs of node m and n, relating the proposed\nmodel to the Gaussian process distribution regression model presented in e.g. [15].\nMore speci\ufb01cally, we can view the GGP as a distribution classi\ufb01cation model for the labelled bags\nof node features {({xi|i \u2208 {n \u222a N e(n)}}, yn)}O\nn=1, such that the unobserved distribution Pn that\ngenerates {xi|i \u2208 {n \u222a N e(n)}} is summarized by its empirical kernel mean embedding\n\n(cid:88)\n\n1\n\n\u02c6\u00b5n =\n\n1 + Dn\n\nj\u2208{n\u222aN e(n)}\n\n\u03c6(xj).\n\n(13)\n\nThe prior on h can equivalently be expressed as h \u223c GP(0,(cid:104)\u02c6\u00b5m, \u02c6\u00b5n(cid:105)H). For detailed reviews of the\nkernel mean embedding and distribution regression models, we refer the readers to [32] and [41]\nrespectively.\nOne main assumption of the 1-hop neighbourhood averaging mechanism is homophily - i.e., nodes\nwith similar covariates are more likely to form connections with each others [17]. The assumption\nallows us to approximately treat the node covariates from a 1-hop neighbourhood as samples drawn\nfrom the same data distribution, in order to model them using distribution regression. While it is\nperfectly reasonable to consider multi-hops neighbourhood averaging, the homophily assumption\n\n4\n\n\fstarts to break down if we consider 2-hop neighbours which are not directly connected. Nevertheless,\nit is interesting to explore non-naive ways to account for multi-hop neighbours in the future, such\nas stacking 1-hop averaging graph GPs in a structure similar to that of the deep Gaussian processes\n[10, 34], or having multiple latent GPs for neighbours of different hops that are summed up in the\nlikelihood functions.\n\n3.1 An Alternative View of GGP\n\nIn this section, we present an alternative formulation of the GGP, which results in an intuitive\ninterpretation of the model. The alternative formulation views the GGP as a Bayesian linear model on\nfeature maps of the nodes that have been transformed by a function related to the graph Laplacian L.\nAs we reviewed in Section 2.1, the kernel matrix KXX in Equation 11 can be written as the product\nX where row n of \u03a6X corresponds to the feature maps of node n,\nof feature map matrix \u03a6X\u03a6T\n\u03c6(xn) = [\u03c6n1, . . . , \u03c6nQ]T. Therefore, the covariance matrix in Equation 11, P\u03a6X\u03a6T\nXPT, can be\nviewed as the product of the transformed feature maps\n\n(cid:98)\u03a6X = P\u03a6X = (I + D)\u22121D\u03a6X + (I + D)\u22121(I \u2212 L)\u03a6X.\n\n(14)\n\nwhere L is the graph Laplacian matrix as de\ufb01ned in Equation 6. Isolating the transformed feature\n\nmaps for node n (i.e., row n of (cid:98)\u03a6X) gives\n\n1\n\n[(I \u2212 L)\u03a6X]T\nn,\n\nDn\n\n1 + Dn\n\n\u02c6\u03c6(xn) =\n\n1 + Dn\n\n\u03c6(xn) +\n\n(15)\nwhere Dn is the degree of node n and [\u00b7]n denotes row n of a matrix. The proposed GGP model is\nequivalent to a supervised Bayesian linear classi\ufb01cation model with a feature pre-processing step\nthat follows from the expression in Equation 15. For isolated nodes (Dn = 0), the expression in\nEquation 15 leaves the node feature maps unchanged ( \u02c6\u03c6 = \u03c6).\nThe (I \u2212 L) term in Equation 15 can be viewed as a spectral \ufb01lter U(I \u2212 \u039b)UT, where U and\n\u039b are the eigenmatrix and eigenvalues of the Laplacian as de\ufb01ned in Section 2.2. For connected\nnodes, the expression results in new features that are weighted averages of the original features and\nfeatures transformed by the spectral \ufb01lter. The alternative formulation opens up opportunities to\ndesign other spectral \ufb01lters with different regularization properties, such as those described in [39],\nthat can replace the (I \u2212 L) expression in Equation 15. We leave the exploration of this research\ndirection to future work.\nIn addition, it is well-known that many graphs and networks observed in the real world follow the\npower-law node degree distributions [17], implying that there are a handful of nodes with very large\ndegrees (known as hubs) and many with relatively small numbers of connections. The nodes with\nfew connections (small Dn) are likely to be connected to one of the handful of heavily connected\nnodes, and their transformed node feature maps are highly in\ufb02uenced by the features of the hub nodes.\nOn the other hand, individual neighbours of the hub nodes have relatively small impact on the hub\nnodes because of the large number of neighbours that the hubs are connected to. This highlights the\nasymmetric outsize in\ufb02uence of hubs in the proposed GGP model, such that a mis-labelled hub node\nmay result in a more signi\ufb01cant drop in the model\u2019s accuracy compared to a mis-labelled node with\nmuch lower degree of connections.\n\n3.2 Variational Inference with Inducing Points\n\nPosterior inference for the GGP is analytically intractable because of the non-conjugate likelihood.\nWe approximate the posterior of the GGP using a variational inference algorithm with inducing\npoints similar to the inter-domain inference algorithm presented in [42]. Implementing the GGP\nwith its variational inference algorithm amounts to implementing a new kernel function that follows\nEquation 12 in the GP\ufb02ow Python package.1\nWe introduce a set of M inducing random variables u = [f (z1), . . . , f (zM )]T indexed by inducing\npoints {zm}M\n\nm=1 in the same domain as the GP function f (x) \u223c GP(cid:0)0, k\u03b8(x, x(cid:48))(cid:1). As a result, the\n\n1https://github.com/markvdw/GP\ufb02ow-inter-domain\n\n5\n\n\finter-domain covariance between hn and f (zm) is\n\nCov(hn, f (zm)) =\n\n1\n\nDn + 1\n\nk\u03b8(xn, zm) +\n\n(cid:104)\n\n(cid:105)\n\n.\n\nk\u03b8(xl, zm)\n\n(cid:88)\n\nl\u2208N e(n)\n\n(16)\n\nAdditionally, we introduce a multivariate Gaussian variational distribution q(u) = N (m, SST) for\nthe inducing random variables with variational parameters m \u2208 RM\u00d71 and the lower triangular\nS \u2208 RM\u00d7M . Through Gaussian conditioning, q(u) results in the variational Gaussian distribution\nq(h) that is of our interest. The variational parameters m, S,{zm}M\nm=1 and the kernel hyper-\nparameters \u03b8 are then jointly \ufb01tted by maximizing the ELBO function in Equation 5.\n\n3.3 Computational Complexity\nThe computational complexity of the inference algorithm is O(|Yo|M 2). In the experiments, we chose\nM to be the number of labelled nodes in the graph |Yo|, which is small relative to the total number of\nnodes. Computing the covariance function in Equation 12 incurs a computational cost of O(D2\nmax)\nper labelled node, where Dmax is the maximum node degree. In practice, the computational cost\nof computing the covariance function is small because of the sparse property of graphs typically\nobserved in the real-world [17].\n\n4 Related Work\n\nGraph-based learning problems have been studied extensively by researchers from both machine\nlearning and signal processing communities, leading to many models and algorithms that are well-\nsummarized in review papers [4, 35, 37].\nGaussian process-based models that operate on graphs have previously been developed in the closely\nrelated relational learning discipline, resulting in the mixed graph Gaussian process (XGP) [38] and\nrelational Gaussian process (RGP) [8]. Additionally, the renowned Label Propagation (LP)[48] model\ncan also be viewed as a GP with its covariance structure speci\ufb01ed by the graph Laplacian matrix [49].\nThe GGP differs from the previously proposed GP models in that the local neighbourhood structures\nof the graph and the node features are directly used in the speci\ufb01cation of the covariance function,\nresulting in a simple model that is highly effective.\nModels based on Laplacian regularization that restrict the node labels to vary smoothly over graphs\nhave also been proposed previously. The LP model can be viewed as an instance under this framework.\nOther Laplacian regularization based models include the deep semi-supervised embedding [44] and\nthe manifold regularization [3] models. As shown in the experimental results in Table 1, the predictive\nperformance of these models fall short of other more sophisticated models.\nAdditionally, models that extract embeddings of nodes and local sub-graphs which can be used for\npredictions have also been proposed by multiple authors. These models include DeepWalk [33],\nnode2vec [19], planetoid [46] and many others. The proposed GGP is related to the embedding\nbased models in that it can be viewed as a GP classifer that takes empirical kernel mean embeddings\nextracted from the 1-hop neighbourhood sub-graphs as inputs to predict node labels.\nFinally, many geometric deep learning models that operate on graphs have been proposed and\nshown to be successful in graph-based semi-supervised learning problems. The earlier models\nincluding [26, 36, 18] are inspired by the recurrent neural networks. On the other hand, convolution\nneural networks that learn convolutional \ufb01lters in the graph Laplacian spectral domain have been\ndemonstrated to perform well. These models include the spectral CNN [5], DCNN [1], ChebNet\n[12] and GCN [25]. Neural networks that operate on the graph spectral domain are limited by the\ngraph-speci\ufb01c Fourier basis. The more recently proposed MoNet [31] addressed the graph-speci\ufb01c\nlimitation of spectral graph neural networks. The idea of \ufb01ltering in graph spectral domain is a\npowerful one that has also been explored in the kernel literatures [39, 43]. We draw parallels between\nour proposed model and the spectral \ufb01ltering approaches in Section 3.1, where we view the GGP as a\nstandard GP classi\ufb01er operating on feature maps that have been transformed through a \ufb01lter that can\nbe related to the graph spectral domain.\nOur work has also been inspired by literatures in Gaussian processes that mix GPs via an additive\nfunction, such as [6, 14, 42].\n\n6\n\n\f5 Experiments\n\nWe present two sets of experiments to benchmark the predictive performance of the GGP against\nexisting models under two different settings. In Section 5.1, we demonstrate that the GGP is a viable\nand extremely competitive alternative to the graph convolutional neural network (GCN) in settings\nwhere there are suf\ufb01cient labelled data points. In Section 5.2, we test the models in an active learning\nexperimental setup, and show that the GGP outperforms the baseline models when there are few\ntraining labels.\n\n5.1 Semi-supervised Classi\ufb01cation on Graphs\n\nThe semi-supervised classi\ufb01cation experiments in this section exactly replicate the experimental setup\nin [25], where the GCN is known to perform well. The three benchmark data sets, as described in\nTable 2, are citation networks with bag-of-words (BOW) features, and the prediction targets are the\ntopics of the scienti\ufb01c papers in the citation networks.\nThe experimental results are presented in Table 1, and show that the predictive performance of the\nproposed GGP is competitive with the GCN and MoNet [31] (another deep learning model), and\nsuperior to the other baseline models. While the GCN outperforms the proposed model by small\nmargins on the test sets with 1, 000 data points, it is important to note that the GCN had access to 500\nadditional labelled data points for early stopping. As the GGP does not require early stopping, the\nadditional labelled data points can instead be directly used to train the model to signi\ufb01cantly improve\nthe predictive performance. To demonstrate this advantage, we report another set of results for a GGP\ntrained using the 500 additional data points in Table 1, in the row labelled as \u2018GGP-X\u2019. The boost in\nthe predictive performances shows that the GGP can better exploit the available labelled data to make\npredictions.\nThe GGP base kernel of choice is the 3rd degree polynomial kernel, which is known to work well\nwith high-dimensional BOW features [45]. We re-weighed the BOW features using the popular term\nfrequency-inverse document frequency (TFIDF) technique [40]. The variational parameters and the\nhyper-parameters were jointly optimized using the ADAM optimizer [24]. The baseline models that\nwe compared to are the ones that were also presented and compared to in [25] and [31].\n\nGGP\nGGP-X\nGCN[25]\nDCNN[1]\nMoNet[31]\nDeepWalk[33]\nPlanetoid[46]\nICA[27]\nLP[48]\nSemiEmb[44]\nManiReg[3]\n\nCora Citeseer Pubmed\n77.1%\n80.9% 69.7%\n82.4%\n84.7% 75.6%\n79.0%\n81.5%\n70.3%\n76.8%\n73.0%\n78.8%\n81.7%\n65.3%\n67.2%\n77.2%\n75.7%\n75.1%\n73.9%\n63.0%\n68.0%\n71.1%\n59.0%\n59.5%\n70.7%\n\n43.2%\n64.7%\n69.1%\n45.3%\n59.6%\n60.1%\n\n-\n-\n\nTable 1: This table shows the test classi\ufb01cation accuracies of the semi-supervised learning experiments\ndescribed in Section 5.1. The test sets consist of 1, 000 data points. The GGP accuracies are averaged\nover 10 random restarts. The results for DCNN and MoNet are copied from [31] while the results for\nthe other models are from [25]. Please refer to Section 5.1 for discussions of the results.\n\nType Nnodes Nedges Nlabel_cat. Dfeatures Label Rate\nCora\nCitation\nCiteseer Citation\nPubmed Citation\n\n2, 708\n3, 327\n19, 717\n\n5, 429\n4, 732\n44, 338\n\n7\n6\n3\n\n1, 433\n3, 703\n500\n\n0.052\n0.036\n0.003\n\nTable 2: A summary of the benchmark data sets for the semi-supervised classi\ufb01cation experiment.\n\n7\n\n\f5.2 Active Learning on Graphs\n\nActive learning is a domain that faces the same challenges as semi-supervised learning where labels\nare scarce and expensive to obtain [47]. In active learning, a subset of unlabelled data points are\nselected sequentially to be queried according to an acquisition function, with the goal of maximizing\nthe accuracy of the predictive model using signi\ufb01cantly fewer labels than would be required if the\nlabelled set were sampled uniformly at random [2]. A motivating example of this problem scenario is\nin the medical setting where the time of human experts is precious, and the machines must aim to\nmake the best use of the time. Therefore, having a data ef\ufb01cient predictive model that can generalize\nwell with few labels is of critical importance in addition to having a good acquisition function.\nIn this section, we leverage GGP as the semi-supervised classi\ufb01cation model of active learner in\ngraph-based active learning problem [47, 28, 11, 22, 29]. The GGP is paired with the proven \u03a3-\noptimal (SOPT) acquisition function to form an active learner [28]. The SOPT acquisition function is\nmodel agnostic in that it only requires the Laplacian matrix of the observed graph and the indices\nof the labelled nodes in order to identify the next node to query, such that the predictive accuracy\nof the active learner is maximally increased. The main goal of the active learning experiments is to\ndemonstrate that the GGP can learn better than both the GCN and the Label Propagation model (LP)\n[48] with very few labelled data points.\nStarting with only 1 randomly selected labelled data point (i.e., node), the active learner identi\ufb01es the\nnext data point to be labelled using the acquisition function. Once the label of the said data point\nis acquired, the classi\ufb01cation model is retrained and its test accuracy is evaluated on the remaining\nunlabelled data points. In our experiments, the process is repeated until 50 labels are acquired. The\nexperiments are also repeated with 10 different initial labelled data points. In addition to the SOPT\nacquisition function, we show the results of the same models paired with the random acquisition\nfunction (RAND) for comparisons.\nThe test accuracies with different numbers of labelled data points are presented as learning curves in\nFigure 2. In addition, we summarize the results numerically using the Area under the Learning Curve\n(ALC) metric in Table 3. The ALC is normalized to have a maximum value of 1, which corresponds\nto a hypothetical learner that can achieve 100% test accuracy with only 1 label. The results show that\nthe proposed GGP model is indeed more data ef\ufb01cient than the baselines and can outperform both the\nGCN and the LP models when labelled data are scarce.\nThe benchmark data sets for the active learning experiments are the Cora and Citeseer data sets.\nHowever, due to technical restriction imposed by the SOPT acquisition function, only the largest\nconnected sub-graph of the data set is used. The restriction reduces the number of nodes in the\nCora and Citeseer data sets to 2, 485 and 2, 120 respectively. Both of the data sets were also used as\nbenchmark data sets in [28].\nWe pre-process the BOW features with TFIDF and apply a linear kernel as the base kernel of the\nGGP. All parameters are jointly optimized using the ADAM optimizer. The GCN and LP models are\ntrained using the settings recommended in [25] and [28] respectively.\n\nCora\n\n0.733 \u00b1 0.001\nSOPT-GGP\n0.706 \u00b1 0.001\nSOPT-GCN\n0.672 \u00b1 0.001\nSOPT-LP\n0.575 \u00b1 0.007\nRAND-GGP\nRAND-GCN 0.584 \u00b1 0.011\n0.424 \u00b1 0.020\nRAND-LP\n\nCiteseer\n\n0.678 \u00b1 0.002\n0.675 \u00b1 0.002\n0.638 \u00b1 0.001\n0.557 \u00b1 0.008\n0.533 \u00b1 0.008\n0.490 \u00b1 0.011\n\nTable 3: This table shows the Area under the Learning Curve (ALC) scores for the active learning\nexperiments. ALC refers to the area under the learning curves shown in Figure 2 normalized to\nhave a maximum value of 1. The ALCs are computed by averaging over 10 different initial data\npoints. The results show that the GGP is able to generalize better with fewer labels compared to the\nbaselines. \u2018SOPT\u2019 and \u2018RAND\u2019 refer to the acquisition functions used. Please refer to Section 5.2 for\ndiscussions of the results.\n\n8\n\n\fFigure 2: The sub-\ufb01gures show the test accuracies from the active learning experiments (y-axis) for\nthe Cora (left) and Citeseer (right) data sets with different number of labelled data points (x-axis).\nThe results are averaged over 10 trials with different initial data points. SOPT and RAND refer to the\nacquisition functions described in Section 5.2. The smaller error bars of \u2018RAND-GGP\u2019 compared\nto those of \u2018RAND-GCN\u2019 demonstrate the relative robustness of the GGP models under random\nshuf\ufb02ing of data points in the training data set. The tiny error bars of the \u2018SOPT-*\u2019 results show that\nthe \u2018SOPT\u2019 acquisition function is insensitive to the randomly selected initial labelled data point.\nPlease also refer to Table 3 for numerical summaries of the results.\n\n6 Conclusion\n\nWe propose a Gaussian process model that is data-ef\ufb01cient for semi-supervised learning problems\non graphs. In the experiments, we show that the proposed model is competitive with the state-of-\nthe-art deep learning models, and outperforms when the number of labels is small. The proposed\nmodel is simple, effective and can leverage modern scalable variational inference algorithm for GP\nwith minimal modi\ufb01cation. In addition, the construction of our model is motivated by distribution\nregression using the empirical kernel mean embeddings, and can also be viewed under the framework\nof \ufb01ltering in the graph spectrum. The spectral view offers a new potential research direction that can\nbe explored in future work.\n\nAcknowledgements\n\nThis work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1.\n\nReferences\n[1] James Atwood and Don Towsley. Diffusion-convolutional neural networks. In Advances in\n\nNeural Information Processing Systems, pages 1993\u20132001, 2016.\n\n[2] Maria-Florina Balcan, Steve Hanneke, and Jennifer Wortman Vaughan. The true sample\n\ncomplexity of active learning. Machine learning, 80(2-3):111\u2013139, 2010.\n\n[3] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric\nframework for learning from labeled and unlabeled examples. Journal of machine learning\nresearch, 7(Nov):2399\u20132434, 2006.\n\n[4] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.\nGeometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine,\n34(4):18\u201342, 2017.\n\n[5] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally\n\nconnected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.\n\n[6] M Yu Byron, John P Cunningham, Gopal Santhanam, Stephen I Ryu, Krishna V Shenoy, and\nManeesh Sahani. Gaussian-process factor analysis for low-dimensional single-trial analysis\n\n9\n\n05101520253035404550Number of Training Samples101520253035404550556065707580Test Accuracy (%)Active Learning - CITESEER Data SetSOPT-GGPSOPT-GCNSOPT-LPRAND-GGPRAND-GCNRAND-LP\fof neural population activity. In Advances in neural information processing systems, pages\n1881\u20131888, 2009.\n\n[7] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle,\no. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542\u2013542,\n2009.\n\n[8] Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, and S Sathiya Keerthi. Relational learning\nwith gaussian processes. In Advances in Neural Information Processing Systems, pages 289\u2013296,\n2007.\n\n[9] Fan R. K. Chung. Spectral Graph Theory. American Mathematical Society, 1997.\n\n[10] Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Arti\ufb01cial Intelligence and\n\nStatistics, pages 207\u2013215, 2013.\n\n[11] Gautam Dasarathy, Robert Nowak, and Xiaojin Zhu. S2: An ef\ufb01cient graph based active\nlearning algorithm with application to nonparametric classi\ufb01cation. In Conference on Learning\nTheory, pages 503\u2013522, 2015.\n\n[12] Micha\u00ebl Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks\non graphs with fast localized spectral \ufb01ltering. In Advances in Neural Information Processing\nSystems, pages 3844\u20133852, 2016.\n\n[13] David Duvenaud. Automatic model construction with Gaussian processes. PhD thesis, Univer-\n\nsity of Cambridge, 2014.\n\n[14] David K Duvenaud, Hannes Nickisch, and Carl E Rasmussen. Additive gaussian processes. In\n\nAdvances in neural information processing systems, pages 226\u2013234, 2011.\n\n[15] Seth R Flaxman, Yu-Xiang Wang, and Alexander J Smola. Who supported Obama in 2012?:\nEcological inference through distribution regression. In Proceedings of the 21th ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, pages 289\u2013298. ACM,\n2015.\n\n[16] Mark Girolami and Simon Rogers. Variational bayesian multinomial probit regression with\n\ngaussian process priors. Neural Computation, 18(8):1790\u20131817, 2006.\n\n[17] Anna Goldenberg, Alice X Zheng, Stephen E Fienberg, Edoardo M Airoldi, et al. A survey\nof statistical network models. Foundations and Trends R(cid:13) in Machine Learning, 2(2):129\u2013233,\n2010.\n\n[18] Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph\ndomains. In Neural Networks, 2005. IJCNN\u201905. Proceedings. 2005 IEEE International Joint\nConference on, volume 2, pages 729\u2013734. IEEE, 2005.\n\n[19] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks.\n\nIn\nProceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pages 855\u2013864. ACM, 2016.\n\n[20] James Hensman, Alexander G Matthews, Maurizio Filippone, and Zoubin Ghahramani. Mcmc\nfor variationally sparse gaussian processes. In Advances in Neural Information Processing\nSystems, pages 1648\u20131656, 2015.\n\n[21] Daniel Hern\u00e1ndez-Lobato, Jos\u00e9 M Hern\u00e1ndez-Lobato, and Pierre Dupont. Robust multi-class\ngaussian process classi\ufb01cation. In Advances in neural information processing systems, pages\n280\u2013288, 2011.\n\n[22] Kwang-Sung Jun and Robert Nowak. Graph-based active learning: A new look at expected\nerror minimization. In Signal and Information Processing (GlobalSIP), 2016 IEEE Global\nConference on, pages 1325\u20131329. IEEE, 2016.\n\n[23] Hyun-Chul Kim and Zoubin Ghahramani. Bayesian gaussian process classi\ufb01cation with the\nem-ep algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12):1948\u2013\n1959, 2006.\n\n10\n\n\f[24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[25] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. arXiv preprint arXiv:1609.02907, 2016.\n\n[26] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural\n\nnetworks. arXiv preprint arXiv:1511.05493, 2015.\n\n[27] Qing Lu and Lise Getoor. Link-based classi\ufb01cation. In Proceedings of the 20th International\n\nConference on Machine Learning (ICML-03), pages 496\u2013503, 2003.\n\n[28] Yifei Ma, Roman Garnett, and Jeff Schneider. \u03c3-optimality for active learning on gaussian\nrandom \ufb01elds. In Advances in Neural Information Processing Systems, pages 2751\u20132759, 2013.\n\n[29] Oisin Mac Aodha, Neill D.F. Campbell, Jan Kautz, and Gabriel J. Brostow. Hierarchical\n\nSubquery Evaluation for Active Learning on a Graph. In CVPR, 2014.\n\n[30] A Matthews. Scalable Gaussian process inference using variational methods. PhD thesis, PhD\n\nthesis. University of Cambridge, 2016.\n\n[31] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodol\u00e0, Jan Svoboda, and\nMichael M. Bronstein. Geometric deep learning on graphs and manifolds using mixture\nmodel cnns. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR\n2017, Honolulu, HI, USA, July 21-26, 2017, pages 5425\u20135434. IEEE Computer Society, 2017.\n\n[32] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Sch\u00f6lkopf, et al. Kernel\nmean embedding of distributions: A review and beyond. Foundations and Trends R(cid:13) in Machine\nLearning, 10(1-2):1\u2013141, 2017.\n\n[33] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social repre-\nsentations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge\ndiscovery and data mining, pages 701\u2013710. ACM, 2014.\n\n[34] Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference for deep gaussian\n\nprocesses. In Advances in Neural Information Processing Systems, pages 4588\u20134599, 2017.\n\n[35] Aliaksei Sandryhaila and Jose MF Moura. Big data analysis with signal processing on graphs:\nIEEE Signal\n\nRepresentation and processing of massive data sets with irregular structure.\nProcessing Magazine, 31(5):80\u201390, 2014.\n\n[36] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.\nThe graph neural network model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\n[37] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst.\nThe emerging \ufb01eld of signal processing on graphs: Extending high-dimensional data analysis to\nnetworks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83\u201398, 2013.\n\n[38] Ricardo Silva, Wei Chu, and Zoubin Ghahramani. Hidden common cause relations in relational\n\nlearning. In Advances in neural information processing systems, pages 1345\u20131352, 2008.\n\n[39] Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. In Learning theory\n\nand kernel machines, pages 144\u2013158. Springer, 2003.\n\n[40] Karen Sparck Jones. A statistical interpretation of term speci\ufb01city and its application in retrieval.\n\nJournal of documentation, 28(1):11\u201321, 1972.\n\n[41] Zolt\u00e1n Szab\u00f3, Bharath K Sriperumbudur, Barnab\u00e1s P\u00f3czos, and Arthur Gretton. Learning theory\nfor distribution regression. The Journal of Machine Learning Research, 17(1):5272\u20135311, 2016.\n\n[42] Mark van der Wilk, Carl Edward Rasmussen, and James Hensman. Convolutional gaussian\n\nprocesses. In Advances in Neural Information Processing Systems, pages 2845\u20132854, 2017.\n\n[43] S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt. Graph\n\nkernels. Journal of Machine Learning Research, 11(Apr):1201\u20131242, 2010.\n\n11\n\n\f[44] Jason Weston, Fr\u00e9d\u00e9ric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semi-\nsupervised embedding. In Neural Networks: Tricks of the Trade, pages 639\u2013655. Springer,\n2012.\n\n[45] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning.\n\nthe MIT Press, 2(3):4, 2006.\n\n[46] Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning\nwith graph embeddings. In Proceedings of the 33rd International Conference on International\nConference on Machine Learning - Volume 48, ICML\u201916, pages 40\u201348. JMLR.org, 2016.\n\n[47] Xiaojin Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer\n\nSciences, University of Wisconsin-Madison, 2005.\n\n[48] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian\n\ufb01elds and harmonic functions. In Proceedings of the 20th International conference on Machine\nlearning (ICML-03), pages 912\u2013919, 2003.\n\n[49] Xiaojin Zhu, John D Lafferty, and Zoubin Ghahramani. Semi-supervised learning: From\ngaussian \ufb01elds to gaussian processes. Technical report, Carnegie Mellon University, Computer\nScience Department, 2003.\n\n12\n\n\f", "award": [], "sourceid": 864, "authors": [{"given_name": "Yin Cheng", "family_name": "Ng", "institution": "University College London"}, {"given_name": "Nicol\u00f2", "family_name": "Colombo", "institution": "University College London"}, {"given_name": "Ricardo", "family_name": "Silva", "institution": "University College London"}]}