{"title": "A Flexible Generative Framework for Graph-based Semi-supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3281, "page_last": 3290, "abstract": "We consider a family of problems that are concerned about making predictions for the majority of unlabeled, graph-structured data samples based on a small proportion of labeled samples. Relational information among the data samples, often encoded in the graph/network structure, is shown to be helpful for these semi-supervised learning tasks. However, conventional graph-based regularization methods and recent graph neural networks do not fully leverage the interrelations between the features, the graph, and the labels. In this work, we propose a flexible generative framework for graph-based semi-supervised learning, which approaches the joint distribution of the node features, labels, and the graph structure. Borrowing insights from random graph models in network science literature, this joint distribution can be instantiated using various distribution families. For the inference of missing labels, we exploit recent advances of scalable variational inference techniques to approximate the Bayesian posterior. We conduct thorough experiments on benchmark datasets for graph-based semi-supervised learning. Results show that the proposed methods outperform state-of-the-art models under most settings.", "full_text": "A Flexible Generative Framework for Graph-based\n\nSemi-supervised Learning\n\nJiaqi Ma\u2217\u2020\n\njiaqima@umich.edu\n\nWeijing Tang\u2217\u2021\n\nweijtang@umich.edu\n\nJi Zhu\u2021\n\njizhu@umich.edu\n\nQiaozhu Mei\u2020\u00a7\nqmei@umich.edu\n\nAbstract\n\nWe consider a family of problems that are concerned about making predictions\nfor the majority of unlabeled, graph-structured data samples based on a small\nproportion of labeled samples. Relational information among the data samples,\noften encoded in the graph/network structure, is shown to be helpful for these\nsemi-supervised learning tasks. However, conventional graph-based regularization\nmethods and recent graph neural networks do not fully leverage the interrelations\nbetween the features, the graph, and the labels. In this work, we propose a \ufb02exible\ngenerative framework for graph-based semi-supervised learning, which approaches\nthe joint distribution of the node features, labels, and the graph structure. Borrow-\ning insights from random graph models in network science literature, this joint\ndistribution can be instantiated using various distribution families. For the inference\nof missing labels, we exploit recent advances of scalable variational inference tech-\nniques to approximate the Bayesian posterior. We conduct thorough experiments\non benchmark datasets for graph-based semi-supervised learning. Results show\nthat the proposed methods outperform the state-of-the-art models in most settings.\n\n1\n\nIntroduction\n\nTraditional machine learning methods typically treat data samples as independent and approximate a\nmapping function from the features to the outcome of each individual sample. However, many real-\nworld data, such as social media or scienti\ufb01c articles, often come with richer relational information\namong the individual samples. We consider a family of such scenarios where the relational information\nis stored in a graph structure with the data samples as nodes, and the learning task is to predict the\noutcomes of unlabeled nodes based on the node features, the graph structure, as well as the labels\nof a subset of nodes. In these scenarios, breaking the independence assumption and utilizing such\nrelational information in the prediction models have been shown to be helpful [26, 25, 1, 10, 5, 14, 12].\nHowever, there lacks a principled way to best synergize and utilize the relational information stored\nin the graph together with the information stored in individual nodes. In this paper, we consider the\nproblem of graph-based semi-supervised learning and try to approach this problem by presenting a\n\ufb02exible generative framework that explicitly models the joint relationship among the three key types\nof information in this context: features, outcomes (or labels), and the graph.\nThere are two major classes of existing methods for graph-based semi-supervised learning. The\n\ufb01rst class includes the graph-based regularization methods [26, 25, 1, 14, 12], where explicit regu-\nlarizations are posed to smooth the predictions or feature representations over local neighborhoods.\nThis class of methods share an assumption that some kind of smoothness (e.g., the outcomes of\nadjacent nodes are likely to be the same) should present in the local and global graph structure. The\n\n\u2217The two authors contribute equally to this paper.\n\u2020School of Information, University of Michigan\n\u2021Department of Statistics, University of Michigan\n\u00a7Department of EECS, University of Michigan\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsecond class consists of graph neural networks [10, 5, 21], where the node features within a local\nneighborhood are aggregated into a hidden representation for the ego node and predictions are made\non top of the hidden representations. These existing methods either do not treat the graph as a random\nvariable (but rather as a \ufb01xed observation) or do not jointly model the data features, graph, and\noutcomes.\nWhile having not been well-explored in graph-based semi-supervised learning, we believe that\nmodeling the joint distribution of the data, graph, and labels with generative models has several\nunique advantages over the above methods.\nFirst, generative models can learn succinct underlying structures of the graph data. Rich literature in\nnetwork science [16] has shown that underlying structures often exist in real-world graph data. And\nthere have been many probabilistic generative models [7, 6] that can learn the underlying structures\nwell from observed graph data. Most of the existing graph-based semi-supervised learning methods\ndescribed above view the graph as a \ufb01xed observation and treat it as ground truth. In reality, however,\nan observed graph is often noisy. We expect that through treating features, outcomes, and graph as\nrandom variables, a generative model can capture more general patterns among these entities and\nlearn low-dimensional representations of the data that can take account for the noise in the graph.\nSecond, modeling the joint distribution can extract more general relationship among features, out-\ncomes, and the graph. We argue that both classes of the existing graph-based semi-supervised\nlearning methods only utilize restricted relationships among them. The graph-based regularization\nmethods usually make strong assumptions about smoothness over adjacent nodes. Such assumptions\noften restrict the model capacity, making the models fail to fully utilize the relational information.\nThe graph neural networks, although more \ufb02exible in aggregating node features through the graph\nstructure, usually implicitly assume conditional independence over the outcomes given node features\nand the graph. This might be sub-optimal in utilizing the relational information. Directly modeling\nthe joint distribution with \ufb02exible models allows us to better utilize the relational information.\nMoreover, generative models can better handle the missing-data situation. In real-world applications,\nwe are often faced with imperfect data, where either node features or edges in the graph are missing.\nGenerative models excel in such situations.\nA few previous studies [24] trying to apply generative models to graph-based semi-supervised learning\nhave been restricted to relatively simple model families due to the dif\ufb01culty in ef\ufb01cient training of\ngenerative models. Thanks to recent advances of scalable variational inference techniques [9, 8], we\nare able to propose a \ufb02exible generative framework for graph-based semi-supervised learning. In\nthis work, we use neural networks, latent space models [6], and stochastic block models [7] to form\nthe generative models. And we use graph neural networks as the approximate posterior models in\nthe scalable variational inference. We refer such instantiations of the proposed framework as G3NN\n(Generative Graph models with Graph Neural Networks as approximate posterior). We evaluate the\nproposed framework with four variants of G3NN on three semi-supervised classi\ufb01cation benchmark\ndatasets. Experiments show that our models achieve better performances than the state-of-the-art\nmodels under most settings.\n\n2 Related Work\n\nThis paper mainly focuses on the problem of graph-based semi-supervised learning, where the data\nsamples are connected by a graph and the outcome labels are only available for part of the samples.\nThe goal is to infer the unobserved labels based on both the labeled and unlabeled data as well as the\ngraph structure.\n\n2.1 Graph-based Regularization for Semi-supervised Learning\n\nOne of the most popular types of graph-based semi-supervised learning methods is the graph-based\nregularization methods. The general assumption of such methods is that the data samples are located\nin a low-dimensional manifold where each local neighborhood is a high-dimensional Euclidean space,\nand the graph stores the similarity or proximity of these data samples. Various graph regularizations\nare posed to smooth the outcome predictions of the model or the feature representations of the data\nsamples over the local neighborhood in the graph. Suppose there are n data samples in total and m of\nthem are labeled, the graph-based regularization methods generally conduct semi-supervised learning\n\n2\n\n\fby optimizing the following objective function:\n\nm(cid:88)\n\ni=1\n\nn(cid:88)\n\ni,j=1\n\nLi + \u03b7\n\nwi,jR(fi, fj),\n\nwhere Li is the supervised loss function of sample i; R(\u00b7,\u00b7) is a regularization function and wi,j\nis a graph-based coef\ufb01cient; fi, fj could be the outcome predictions [26, 25, 1] or the feature\nrepresentations [14, 22, 12] of nodes i and j; \u03b7 is a hyper-parameter trading-off the supervised loss\nand the graph-based regularization. Different methods can have different variants of the regularization\nterm. Most commonly, it is set as a graph Laplacian regularizer [26, 25, 1, 14, 22]. Such type of\nmodels heavily rely on the smoothness assumption over the graph, which restricts the modeling\ncapacity [10].\n\n2.2 Graph Neural Networks for Semi-supervised Learning\n\nAnother class of methods that have gained great attention recently are the graph neural networks [10,\n5, 21]. A graph neural network aggregates the node features within a local neighborhood into a\nhidden representation for the central node. Such aggregation operations can also be stacked on top of\nthe hidden representations to form deeper neural networks. Generally, a single aggregation operation\nfor node i at depth l can be represented as follows,\n\n(cid:88)\n\nj\u2208Ni\n\nhl\ni = \u03c3(\n\n\u03b1i,jW hl\u22121\n\nj\n\n),\n\ni is the hidden representation of i at lth layer; Ni is the neighbor set of i; W is a learnable\nwhere hl\nlinear transformation matrix; \u03c3 is an element-wise nonlinear activation function; and different\nmodels have different de\ufb01nitions of \u03b1i,j. For Graph Convolutional Networks [10], \u03b1i,j = 1/di or\n\n\u03b1i,j = 1/(cid:112)didj, where di is the number of neighbors of i. For Graph Attention Networks [21], \u03b1i,j\n\nis de\ufb01ned as an attention function between i and j. Finally, the predictions of each node are made\non top of hidden representations in the last layer. Such methods usually model the mapping from\nthe features and the graph to the outcome of an individual node, which assume the outcomes are\nconditionally independent given the features and the graph. This assumption prevents the model from\nutilizing the joint relationship among the outcomes over the graph. Our framework models the joint\ndistribution of the features, outcomes, and the graph, and it is not restricted to this assumption. A\nconcurrent work [18] also tries to mitigate this assumption. They take a statistical relational learning\npoint of view and model the outcome dependency with a Markov network conditioned on the graph,\nwhile we take a generative model point of view and instantiate the joint distribution with random\ngraph models.\n\n2.3 Generative Methods for Graph-based Semi-supervised Learning\n\nMost methods from the above two classes treat the graph as \ufb01xed observation and only a few\nmethods [17, 24, 13] treat the graph as a random variable and model it with generative models.\nAmong them, Ng et al. [17] focused more on an active learning setting on graphs; Zhang et al. [24]\nmodeled the graph along with a stochastic block model and did not consider the interaction between\nthe graph and the features or the labels in the generative model of the graph; Liu [13] shares the\nmost similar generative model with our framework, but considers a supervised learning setting where\nthe labels are fully observed. To our best knowledge, our work is the \ufb01rst to propose a generative\nframework for graph-based semi-supervised learning that models the joint distribution of features,\noutcomes, and the graph with \ufb02exible nonlinear models.\nFinally, as a side note, there is a recently active area of deep generative models for graphs [11, 3].\nBut these models focus more on generating realistic graph topology structures and are less related to\nthe graph-based semi-supervised learning, which is the problem of interest in this work.\n\n3 Approach\n\n3.1 Problem Setup\n\nWe start by formally introducing the problem of graph-based semi-supervised learning. Given a\nset of data samples D = {(xi, yi)}n\ni=1, xi \u2208 Rd and yi \u2208 Rl are the feature and outcome vectors\n\n3\n\n\fm(cid:88)\n\ni=1\n\n\u02c6f = arg min\n\nf\n\n1\nm\n\nL(yi, f (xi; X, G)) + \u03bbR(f ; G),\n\nof sample i respectively. We further denote X \u2208 Rn\u00d7d and Y \u2208 Rn\u00d7l as the matrices formed by\nfeature and outcome vectors. The dataset also comes with a graph G = (V,E) with the data samples\nas nodes, where V = {1, 2,\u00b7\u00b7\u00b7 , n} is the set of nodes and E \u2286 V \u00d7 V is the set of edges. In the\nsemi-supervised learning setting, only 0 < m < n samples have observed their outcome labels and\nthe outcome labels of other samples are missing. Without loss of generality, we assume the outcomes\nof the samples 1, 2,\u00b7\u00b7\u00b7 , m are observed and that of m + 1,\u00b7\u00b7\u00b7 , n are missing. Therefore we can\npartition the outcome matrix as\n\n(cid:20) Yobs\n\nYmiss\n\n(cid:21)\n\n.\n\nY =\n\nThe goal of graph-based semi-supervised learning is to infer Ymiss based on (X, Yobs, G). For\ndiscriminative methods, we are learning the conditional distribution of p(Y|X, G). This is usually\ndone by learning a prediction model y = f (x; X, G) using empirical risk minimization, optionally\nwith regularizations:\n\nwhere L(\u00b7,\u00b7) is a loss function, R(\u00b7; G) is a graph-based regularization term, and \u03bb is a hyper-\nparameter controlling the strength of the regularization. Then \u02c6f is used to predict Ymiss.\nThere are two speci\ufb01c learning settings, namely transductive learning and inductive learning, which\nare common in graph-based semi-supervised learning. Transductive learning assumes that X and G\nare fully observed during both the learning stage and the inference stage while inductive learning\nassumes Xm+1:n and the nodes of m + 1,\u00b7\u00b7\u00b7 , n in G are missing during the learning stage but\navailable during the inference stage. In the following of this paper, we will mainly focus on the\ntransductive learning setting but our method can also be extended to the inductive learning setting.\n\n3.2 A Flexible Generative Framework for Graph-based Semi-supervised Learning\n\nIn discriminative methods, the graph G is usually viewed as a \ufb01xed observation. In reality, however,\nthere is usually considerable noise in the graph. Moreover, we want to take advantage of the\nunderlying structure among X, Y , and G to improve the prediction performance under this semi-\nsupervised learning setting. In this work, we propose a \ufb02exible generative framework that can model\na wide range of the forms of the joint distribution p(X, Y, G), where X, Y, and G are the random\nvariables corresponding to X, Y , and G. We also denote Yobs and Ymiss as the random variable\ncounterparts of Yobs and Ymiss.\nGeneration process. Inspired by the random graph models from the area of network science [16],\nwe assume the graph is generated based on the node features and outcomes. The generation process\ncan be illustrated by the following factorization of the joint distribution:\np(X, Y, G) = p(G|X, Y)p(Y|X)p(X),\n\nwhere the conditional probabilities p(G|X, Y) and p(Y|X) will be modeled by some \ufb02exible\nparametric families distributions p\u03b8(G|X, Y) and p\u03b8(Y|X) with parameters \u03b8. By \"\ufb02exible\" we\nmean that the only restriction on the PMFs of these conditional probabilities is that they need to\nbe differentiable almost everywhere w.r.t. \u03b8; and we do not assume the integral of the marginal\n\ndistribution p\u03b8(G|X) =(cid:82) p\u03b8(Y|X)p\u03b8(G|Y, X)dY is tractable. For simplicity, we do not specify\n\nthe distribution p(X) and everything will be conditioned on X later in this paper.\n\nModel inference. To infer the missing outcomes Ymiss, we would need the posterior distribution\np\u03b8(Ymiss|X, Yobs, G), which is usually intractable under many \ufb02exible generative models. Follow-\ning the recent advances in scalable variational inference [9, 8], we introduce a recognition model\nq\u03c6(Ymiss|X, Yobs, G) parameterized by \u03c6 to approximate the true posterior p\u03b8(Ymiss|X, Yobs, G).\nModel learning. We train the model parameters \u03b8 and \u03c6 by optimizing the Evidence Lower BOund\n(ELBO) of the observed data (Yobs, G) conditioned on X. The negative ELBO loss LELBO is de\ufb01ned\nas follows,\nlog p(Yobs, G|X) \u2265 Eq\u03c6(Ymiss|X,Yobs,G)(log p\u03b8(Ymiss, Yobs, G|X) \u2212 log q\u03c6(Ymiss|X, Yobs, G))\n\n(cid:44) \u2212LELBO(\u03b8, \u03c6; X, Yobs, G).\n\n4\n\n\fAnd the optimal model parameters are obtained by minimizing the above loss:\n\n\u02c6\u03b8, \u02c6\u03c6 = arg min\n\n\u03b8,\u03c6\n\nLELBO(\u03b8, \u03c6; X, Yobs, G).\n\n3.3 G3NN Instantiations\nFor practical use, it remains to specify the parametric forms of the generative model p\u03b8(G|X, Y)\nand p\u03b8(Y|X), and the approximate posterior model q\u03c6(Ymiss|X, Yobs, G). In this section, we\ninstantiate the Generative Graph model with two types of random graph models, and adopt two types\nof Graph Neural Networks as the approximate posterior model, which leads to four variants of G3NN.\nAs proof of the effectiveness of our general framework, we intend to instantiate its components with\nsimple models and leave room for optimizing its performance with more complex instantiations. The\ngenerative framework proposed above does not restrict the type of outcomes. As proof of concept,\nwe focus on multi-class classi\ufb01cation outcomes in the following of the paper and denote the number\nof classes as K.\n\n3.3.1 Instantiations of the Generative Model\nFor p\u03b8(Y|X) in the generative model, we simply instantiate it with a multi-layer perceptron. For\np\u03b8(G|X, Y) in the generative model, we have come up with two instantiations inspired by the\ngenerative network models from network science literature. There are two major classes of generative\nmodels for complex networks: the latent space models (LSM) [6] and the stochastic block models\n(SBM) [7]. We instantiate a simple model from each class as our generative models.\nA general assumption used by both classes of generative models for networks is that the edges are\nconditionally independent. Let ei,j \u2208 V \u00d7 V be the binary edge random variable between node i\nand j. ei,j = 1 indicates the edge between i and j exists and 0 otherwise. Based on the conditional\nindependence assumption of edges, the conditional probability of the graph G can be factorized as\n\n(cid:89)\n\np\u03b8(G|X, Y) =\n\np\u03b8(ei,j|X, Y).\n\ni,j\n\nNext we will specify the instantiations of p\u03b8(ei,j|X, Y).\nInstantiation with an LSM. The latent space model assumes that the nodes lie in a latent space\nand the probability of ei,j only depends the representation of nodes i and j. i.e., p\u03b8(ei,j|X, Y) =\np\u03b8(ei,j|xi, yi, xj, yj). We assume it follows a logistic regression model:\n\np\u03b8(ei,j = 1|xi, yi, xj, yj) = \u03c3([(U xi)T , yT\n\ni , (U xj)T , yT\n\nj ]w),\n\nwhere \u03c3(\u00b7) is the sigmoid function; w are the learnable parameters of the logistic regression model;\nU is a linear transformation matrix with learnable parameters (e.g., word embedding when x is a\nbag-of-words feature); the class labels yi, yj are represented as one-hot vectors, and we concatenate\nthe transformed features and the class labels of a pair of nodes as input of the logistic regression\nmodel. All the learnable parameters are included in \u03b8.\n\nInstantiation with an SBM. The stochastic block model assumes there are C types of nodes and\neach node i has a (latent) type variable zi \u2208 {1, 2,\u00b7\u00b7\u00b7 , C} and the probability of edge ei,j only\ndepends on node types zi and zj. In a general SBM, we have p\u03b8(ei,j = 1|zi, zj) = pzi,zj , and the\npu,v for all u, v \u2208 {1, 2,\u00b7\u00b7\u00b7 , C} form a probability matrix P and are free parameters to be \ufb01tted.\nIn our model, we assume the node types are the class labels, i.e., C = K and zi being the correspond-\ning class of yi, and p\u03b8(ei,j|X, Y) = p\u03b8(ei,j|yi, yj). Note that in our notation yi is a one-hot vector.\nWe specify p\u03b8(ei,j|yi, yj) with the simplest SBM, which is also called the planted partition model.\nThat is,\n\n(cid:26) Bernoulli(p0)\n\nBernoulli(p1)\n\nei,j|yi, yj \u223c\n\nif yi = yj\nif yi (cid:54)= yj\n\n.\n\nThis means the probability matrix P has all the diagonal elements equal to a constant p0 and all the\noff-diagonal elements equal to another constant p1.\n\n5\n\n\f3.3.2 Instantiations of the Approximate Posterior Model\nFor the approximate posterior model q\u03c6(Ymiss|X, Yobs, G), in principle we need a strong function\napproximator that takes (X, Yobs, G) as the input and outputs the probability of Ymiss. Here we\nconsider two recently invented graph neural networks: the Graph Convolutional Network (GCN) [10]\nand the Graph Attention Network (GAT) [21]. Note that by doing this we are making a further\napproximation from q\u03c6(Ymiss|X, Yobs, G) to q\u03c6(Ymiss|X, G), as the graph neural networks by\ndesign only take (X, G) as the input. This approximation is known as the mean-\ufb01eld method, which\nis commonly used in variational inference [2].\n\n3.4 Training\n\nFinally, we end this section by introducing two practical details of model training.\n\nSupervised loss. As our main task is to conduct classi\ufb01cation with the approximate posterior model,\nsimilar with Kingma et al. [8], we add an additional supervised loss to better train the approximate\nposterior model:\n\nLs(\u03c6; X, Yobs, G) = \u2212 log q\u03c6(Yobs|X, G).\n\nThe total loss is controlled by a weight hyper-parameter \u03b7,\n\nL(\u03b8, \u03c6) = LELBO(\u03b8, \u03c6; X, Yobs, G) + \u03b7 \u00b7 Ls(\u03c6; X, Yobs, G).\n\nWe could rewrite the total loss in an alternative way as follows,\nL(\u03b8, \u03c6) =Eq\u03c6(Ymiss|X,G) \u2212 log p\u03b8(G|X, Yobs, Ymiss) + DKL(q\u03c6(Ymiss|X, G)(cid:107)p\u03b8(Ymiss|X))\n\n\u2212 log p\u03b8(Yobs|X) \u2212 \u03b7 \u00b7 log q\u03c6(Yobs|X, G),\n\nwhich provides a connection between the proposed generative framework and existing graph neural\nnetworks. The fourth term \u2212\u03b7\u00b7log q\u03c6(Yobs|X, G) provides supervised information from labeled data\nfor the approximate posterior GCN or GAT, while the other three terms can be viewed as additional\nregularizations: the learned GCN or GAT is encouraged to support the generative model of the graph,\nand not to go far away from p\u03b8(Y|X).\nNegative edge sampling. For both LSM and SBM based models, the probability p\u03b8(G|X, Y)\nfactorizes to the production of probabilities of all possible edges. Calculating the log-likelihood\nof it requires enumeration of all possible (i, j) pairs where i, j \u2208 {1, 2,\u00b7\u00b7\u00b7 , n}, which results in a\nO(n2) computational cost at each epoch. In practice, instead of going through all (i, j) pairs, we only\ncalculate the probabilities of the edges observed in the graph and a set of \"negative edges\" randomly\nsampled from the (i, j) pairs where edges do not exist. This practical trick, named negative sampling,\nis commonly used in the training of word embeddings [15] and graph embeddings [20].\n\n4 Experiments\n\nIn this section, we evaluate the proposed variants of G3NN on several benchmark datasets for graph-\nbased semi-supervised learning. We test the models under both the standard benchmark setting [23]\nas well as two data-scarce settings.\n\n4.1 Standard Benchmark Setting\n\nWe \ufb01rst consider a standard benchmark setting in recent graph-based semi-supervised learning\nliterature [23, 10, 21].\n\nDatasets. We use three standard semi-supervised learning benchmark datasets for graph neural\nnetworks, Citeseer, Cora, and Pubmed [19, 23]. The graph G of each dataset is a citation network\nwith documents as nodes and citations as edges. The feature vector of each node is a bag-of-words\nrepresentation of the document and the class label represents the research area this document belongs\nto. We adopt these datasets from the PyTorch-Geometric library [4] in our experiments5. For each\n\n5The datasets loaded by the PyTorch-Geometric data loader have slightly less edges than those reported in\n\nYang et al. [23], which is believed due to the existence of duplicate edges in the original datasets.\n\n6\n\n\fTable 1: Summary of benchmark datasets.\n\nDataset\nCora\n\nPubmed\nCiteseer\n\n# Classes\n\n7\n3\n6\n\n# Nodes\n2,708\n19,717\n3,327\n\n# Edges Avg. 2-Neighborhood Size\n5,278\n44,324\n4,552\n\n35.8\n59.1\n14.1\n\nTable 2: Classi\ufb01cation accuracy under the standard benchmark setting. The upper block lists the\ndiscriminative baselines. The lower block lists the proposed variants of G3NN. The bold marker\ndenotes the best performance on each dataset. The underline marker denotes that the generative model\noutperforms its discriminative counterpart, e.g., LSM-GCN outperforms GCN; and the asterisk (*)\nmarker denotes the difference is statistically signi\ufb01cant by a t-test at signi\ufb01cance level 0.05. The\n(\u00b1) error bar denotes the standard deviation of the test performance of 10 independent trials.\n\nMLP\nGCN\nGAT\n\nCora\n0.583 \u00b1 0.009\n0.815 \u00b1 0.002\n0.825 \u00b1 0.005\nLSM_GCN 0.825 \u00b1 0.002*\n0.829 \u00b1 0.003\nLSM_GAT\nSBM_GCN 0.822 \u00b1 0.002*\n0.829 \u00b1 0.003\nSBM_GAT\n\nPubmed\n0.734 \u00b1 0.002\n0.794 \u00b1 0.004\n0.785 \u00b1 0.004\n0.779 \u00b1 0.004\n0.776 \u00b1 0.007\n0.784 \u00b1 0.006\n0.774 \u00b1 0.004\n\nCiteseer\n0.569 \u00b1 0.008\n0.718 \u00b1 0.003\n0.715 \u00b1 0.007\n0.744 \u00b1 0.003*\n0.731 \u00b1 0.005*\n0.745 \u00b1 0.004*\n0.740 \u00b1 0.003*\n\ndataset, we summarize number of classes, number of nodes, number of edges, and average number of\nnodes within the 2-hop neighborhood of each node in Table 1. In this standard benchmark setting, we\nclosely follow the dataset setup in Yang et al. [23] and Kipf and Welling [10].\nModels for comparison. For the proposed framework, we implement four variants (2\u00d7 2) of G3NN\nby combining the two generative model instantiations with the two approximate posterior model\ninstantiations: LSM-GCN, SBM-GCN, LSM-GAT, SBM-GAT.\nFor baselines, we compare against two state-of-the-art models for the graph-based semi-supervised\nlearning, GCN [10] and GAT [21]. We also include a multi-layer perceptron (MLP), which is a fully\nconnected neural network without using any graph information, as a reference.\nWe use the original architectures of GCN and GAT models in both the baselines and the proposed\nmethods. We grid search the number of hidden units from (16, 32, 64) and the learning rate from\n(0.001, 0.005, 0.01). GAT uses a multi-head attention mechanism. In our experiments, we \ufb01x the\nnumber of heads as 8 and try to set the total number of hidden units as (16, 32, 64) and to set\nthe number of hidden units of a single head as (16, 32, 64). In the proposed methods, we set the\ngenerative model for p\u03b8(Y|X) as a two-layer MLP having the same number of hidden units as the\ncorresponding GCN or GAT in the posterior model. For the MLP baseline, we also set the number of\nlayers as 2 and grid search the number of hidden units and learning rate like other models. For the\nproposed generative models, we grid search the coef\ufb01cient of the supervised loss \u03b7 from (0.5, 1, 10).\nThe number of negative edges is set to be the number of the observed edges in the graph. For LSM\nmodels, the dimensions of the feature transformation matrix U is \ufb01xed to 8 \u00d7 d, where d is the\nfeature size. For SBM models, we use two settings of (p0, p1): (0.9, 0.1) and (0.5, 0.6). We use\nAdam optimizer to train all the models and apply early stopping with the cross-entropy loss on the\nvalidation set. We adopt the implementations of GCN and GAT from the PyTorch-Geometric [4]\nlibrary in all our experiments.\n\nResults. The performance of the baselines and proposed models under the standard benchmark\nsetting is summarized in Table 2. We report the mean and the standard deviation of the test accuracy\nof 10 independent trials for each model. The results show that on all datasets except for Pubmed,\nthe proposed methods achieve the best test accuracy on the standard benchmark setting. Notably,\nevery instantiation model of the proposed generative framework outperforms their corresponding\ndiscriminative baseline (GCN or GAT) in most cases. We also note that GCN performs better than\nGAT and the proposed models on Pubmed. We conjecture that, when the number of classes is\n\n7\n\n\fTable 3: Classi\ufb01cation accuracy under the missing-edge setting. The bold marker, the underline\nmarker, the asterisk (*) marker, and the (\u00b1) error bar share the same de\ufb01nitions in Table 2.\n\nMLP\nGCN\nGAT\n\nCora\n0.583 \u00b1 0.009\n0.665 \u00b1 0.007\n0.682 \u00b1 0.004\nLSM_GCN 0.711 \u00b1 0.005*\n0.710 \u00b1 0.007*\nLSM_GAT\nSBM_GCN 0.718 \u00b1 0.004*\n0.716 \u00b1 0.007*\nSBM_GAT\n\nPubmed\n0.734 \u00b1 0.002\n0.746 \u00b1 0.004\n0.744 \u00b1 0.006\n0.766 \u00b1 0.006*\n0.766 \u00b1 0.004*\n0.762 \u00b1 0.005*\n0.761 \u00b1 0.005*\n\nCiteseer\n0.569 \u00b1 0.008\n0.652 \u00b1 0.005\n0.642 \u00b1 0.004\n0.704 \u00b1 0.002*\n0.691 \u00b1 0.005*\n0.716 \u00b1 0.004*\n0.709 \u00b1 0.008*\n\nsmall and the graph is relatively dense, GCN may be already quite capable of propagating feature\ninformation from neighbors (see the average size of 2-hop neighborhoods in Table 1). When there\nare more classes or the graph is relatively sparse (e.g., Cora, Citeseer, and the missing-edge setting of\nPubmed in Section 4.2.1), the advantage of our proposed method is more evident.\n\n4.2 Data-Scarce Settings\n\nGenerative models usually have better sample ef\ufb01ciency than discriminative models. Therefore, we\nexpect the proposed generative framework to show bigger advantage when data are scarce. Next,\nwe evaluate the models on the citation datasets under two such settings: missing-edge setting and\nreduced-label setting.\n\n4.2.1 Missing-Edge Setting\n\nIn the standard benchmark setting, we assume that all samples are connected to the graph identically\nand the training, validation, and test set are split randomly in the datasets. In practice, however, the\nsamples we are interested in the test period may not be well connected to the graph. For example, we\nmay not have connection for new users in a social network other than their pro\ufb01le information. In\nthis cold-start situation, one may expect to make predictions purely based on the pro\ufb01le information.\nHowever, we believe that the relational information stored in the graph of the training data can still\nhelp us learn a better and more generalizable model even if some of the predictions are made only\nbased on the node features. And we expect the proposed generative models to work better than the\ndiscriminative baselines in this case because they can better distill the relationship among the data.\nTo mimic such situations, we create a missing-edge setting, where we remove all the edges of the\ntest nodes from the graph. Note that this setting is different from the inductive learning setting in\nprevious works [5, 21] where the edges for the test data are absent during the training stage but\npresent during the test stage. In the missing-edge setting, the edges for the test data are absent during\nboth stages. We follow the same experimental setup as in the standard benchmark setting except for\nthe modi\ufb01cation of the graph.\n\nResults. The performance under the missing-edge setting is shown in Table 3. Not surprisingly, as\nwe lose part of the graph information, the performances of all models except for MLP (which does\nnot use the graph at all) drop compared to the standard benchmark setting in Table 2. However, the\nproposed generative models perform better than their corresponding discriminative baselines by a\nlarge margin. Remarkably, even without knowing any edges of out-of-sample nodes, the accuracy of\nSBM-GCN on Citeseer can achieve the state-of-the-art level of GCN under the standard benchmark\nsetting.\n\n4.2.2 Reduced-Label Setting\n\nAnother common data-scarce situation is the lack of labeled data. Therefore, we create a reduced-label\nsetting, where we drop half of the training labels for each class compared to the standard benchmark\nsetting. All other experiment and model setups are the same as the standard benchmark setting.\n\nResults. The performance under the reduced-label setting is shown in Table 4. As can be seen from\nthe results, the proposed generative models achieve the best test accuracy on Cora and Citeseer again.\n\n8\n\n\fTable 4: Classi\ufb01cation accuracy under the reduced-label setting. The bold marker, the underline\nmarker, the asterisk (*) marker, and the (\u00b1) error bar share the same de\ufb01nitions in Table 2.\n\nMLP\nGCN\nGAT\n\nCora\n0.498 \u00b1 0.004\n0.750 \u00b1 0.003\n0.771 \u00b1 0.004\nLSM_GCN 0.777 \u00b1 0.002*\n0.792 \u00b1 0.004*\nLSM_GAT\nSBM_GCN 0.780 \u00b1 0.002*\n0.796 \u00b1 0.008*\nSBM_GAT\n\nPubmed\n0.674 \u00b1 0.005\n0.724 \u00b1 0.005\n0.711 \u00b1 0.006\n0.709 \u00b1 0.003\n0.699 \u00b1 0.003\n0.710 \u00b1 0.004\n0.699 \u00b1 0.003\n\nCiteseer\n0.493 \u00b1 0.010\n0.666 \u00b1 0.003\n0.675 \u00b1 0.005\n0.691 \u00b1 0.005*\n0.691 \u00b1 0.004*\n0.703 \u00b1 0.006*\n0.698 \u00b1 0.003*\n\nAnd the gap of the performances between the proposed generative models and the corresponding\ndiscriminative models are larger on Cora and Citeseer.\n\n5 Conclusion\n\nIn this paper, we have presented a \ufb02exible generative framework for graph-based semi-supervised\nlearning. By applying scalable variational inference, this framework is able to take the advantages\nof both recently developed graph neural networks and the wisdom of random graph models from\nclassical network science literature, which leads to the G3NN model. We further implement 4\nvariants of G3NN, instantiations of the proposed framework where we build generative graph models\nwith graph neural networks as the approximate posterior models. Through thorough experiments,\nWe demonstrated that these instantiation models outperform the state-of-the-art graph-based semi-\nsupervised learning methods on most benchmark datasets under the standard benchmark setting. We\nalso showed that the proposed generative framework has great potential in data-scarce situations.\nFor future work, we expect more complex instantiations of generative models to be developed using\nthis framework and to optimize the graph-based semi-supervised learning.\n\nAcknowledgments\n\nThis work was in part supported by the National Science Foundation under grant numbers 1633370\nand 1620319.\n\nReferences\n[1] Belkin, M., Niyogi, P., and Sindhwani, V. (2006). Manifold regularization: A geometric\nframework for learning from labeled and unlabeled examples. Journal of machine learning\nresearch, 7(Nov):2399\u20132434.\n\n[2] Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for\n\nstatisticians. Journal of the American Statistical Association, 112(518):859\u2013877.\n\n[3] De Cao, N. and Kipf, T. (2018). Molgan: An implicit generative model for small molecular\n\ngraphs. arXiv preprint arXiv:1805.11973.\n\n[4] Fey, M. and Lenssen, J. E. (2019). Fast graph representation learning with pytorch geometric.\n\narXiv preprint arXiv:1903.02428.\n\n[5] Hamilton, W., Ying, Z., and Leskovec, J. (2017). Inductive representation learning on large\n\ngraphs. In Advances in Neural Information Processing Systems, pages 1024\u20131034.\n\n[6] Hoff, P. D., Raftery, A. E., and Handcock, M. S. (2002). Latent space approaches to social\n\nnetwork analysis. Journal of the american Statistical association, 97(460):1090\u20131098.\n\n[7] Holland, P. W., Laskey, K. B., and Leinhardt, S. (1983). Stochastic blockmodels: First steps.\n\nSocial networks, 5(2):109\u2013137.\n\n9\n\n\f[8] Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. (2014). Semi-supervised learning\nIn Advances in neural information processing systems, pages\n\nwith deep generative models.\n3581\u20133589.\n\n[9] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114.\n\n[10] Kipf, T. N. and Welling, M. (2016a). Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. arXiv preprint arXiv:1609.02907.\n\n[11] Kipf, T. N. and Welling, M. (2016b). Variational graph auto-encoders. arXiv preprint\n\narXiv:1611.07308.\n\n[12] Li, T., Levina, E., Zhu, J., et al. (2019). Prediction models for network-linked data. The Annals\n\nof Applied Statistics, 13(1):132\u2013164.\n\n[13] Liu, B. (2019). Statistical learning for networks with node features. PhD thesis, University of\n\nMichigan.\n\n[14] Mei, Q., Zhang, D., and Zhai, C. (2008). A general optimization framework for smoothing\nlanguage models on graph structures. In Proceedings of the 31st annual international ACM SIGIR\nconference on Research and development in information retrieval, pages 611\u2013618. ACM.\n\n[15] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Ef\ufb01cient estimation of word represen-\n\ntations in vector space. arXiv preprint arXiv:1301.3781.\n\n[16] Newman, M. (2010). Networks: an introduction. Oxford university press.\n\n[17] Ng, Y. C., Colombo, N., and Silva, R. (2018). Bayesian semi-supervised learning with graph\n\ngaussian processes. In Advances in Neural Information Processing Systems, pages 1683\u20131694.\n\n[18] Qu, M., Bengio, Y., and Tang, J. (2019). Gmnn: Graph markov neural networks. arXiv preprint\n\narXiv:1905.06214.\n\n[19] Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. (2008). Collective\n\nclassi\ufb01cation in network data. AI magazine, 29(3):93\u201393.\n\n[20] Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., and Mei, Q. (2015). Line: Large-scale\ninformation network embedding. In Proceedings of the 24th international conference on world\nwide web, pages 1067\u20131077. International World Wide Web Conferences Steering Committee.\n\n[21] Veli\u02c7ckovi\u00b4c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. (2017). Graph\n\nattention networks. arXiv preprint arXiv:1710.10903.\n\n[22] Weston, J., Ratle, F., Mobahi, H., and Collobert, R. (2012). Deep learning via semi-supervised\n\nembedding. In Neural Networks: Tricks of the Trade, pages 639\u2013655. Springer.\n\n[23] Yang, Z., Cohen, W. W., and Salakhutdinov, R. (2016). Revisiting semi-supervised learning\n\nwith graph embeddings. arXiv preprint arXiv:1603.08861.\n\n[24] Zhang, Y., Pal, S., Coates, M., and \u00dcstebay, D. (2018). Bayesian graph convolutional neural\n\nnetworks for semi-supervised classi\ufb01cation. arXiv preprint arXiv:1811.11103.\n\n[25] Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Sch\u00f6lkopf, B. (2004). Learning with local\nand global consistency. In Advances in neural information processing systems, pages 321\u2013328.\n\n[26] Zhu, X., Ghahramani, Z., and Lafferty, J. D. (2003). Semi-supervised learning using gaussian\n\ufb01elds and harmonic functions. In Proceedings of the 20th International conference on Machine\nlearning (ICML-03), pages 912\u2013919.\n\n10\n\n\f", "award": [], "sourceid": 1831, "authors": [{"given_name": "Jiaqi", "family_name": "Ma", "institution": "University of Michigan"}, {"given_name": "Weijing", "family_name": "Tang", "institution": "University of Michigan"}, {"given_name": "Ji", "family_name": "Zhu", "institution": "University of Michigan"}, {"given_name": "Qiaozhu", "family_name": "Mei", "institution": "University of Michigan"}]}