{"title": "Improving Textual Network Learning with Variational Homophilic Embeddings", "book": "Advances in Neural Information Processing Systems", "page_first": 2076, "page_last": 2087, "abstract": "The performance of many network learning applications crucially hinges on the success of network embedding algorithms, which aim to encode rich network information into low-dimensional vertex-based vector representations. This paper considers a novel variational formulation of network embeddings, with special focus on textual networks. Different from most existing methods that optimize a discriminative objective, we introduce Variational Homophilic Embedding (VHE), a fully generative model that learns network embeddings by modeling the semantic (textual) information with a variational autoencoder, while accounting for the structural (topology) information through a novel homophilic prior design. Homophilic vertex embeddings encourage similar embedding vectors for related (connected) vertices. The VHE encourages better generalization for downstream tasks, robustness to incomplete observations, and the ability to generalize to unseen vertices. Extensive experiments on real-world networks, for multiple tasks, demonstrate that the proposed method achieves consistently superior performance relative to competing state-of-the-art approaches.", "full_text": "Improving Textual Network Learning with\n\nVariational Homophilic Embeddings\n\nWenlin Wang1, Chenyang Tao1, Zhe Gan2, Guoyin Wang1, Liqun Chen1\n\nXinyuan Zhang1, Ruiyi Zhang1, Qian Yang1, Ricardo Henao1, Lawrence Carin1\n\n1Duke University, 2Microsoft Dynamics 365 AI Research\n\nwenlin.wang@duke.edu\n\nAbstract\n\nThe performance of many network learning applications crucially hinges on the\nsuccess of network embedding algorithms, which aim to encode rich network\ninformation into low-dimensional vertex-based vector representations. This paper\nconsiders a novel variational formulation of network embeddings, with special\nfocus on textual networks. Different from most existing methods that optimize a\ndiscriminative objective, we introduce Variational Homophilic Embedding (VHE),\na fully generative model that learns network embeddings by modeling the semantic\n(textual) information with a variational autoencoder, while accounting for the struc-\ntural (topology) information through a novel homophilic prior design. Homophilic\nvertex embeddings encourage similar embedding vectors for related (connected)\nvertices. The proposed VHE promises better generalization for downstream tasks,\nrobustness to incomplete observations, and the ability to generalize to unseen\nvertices. Extensive experiments on real-world networks, for multiple tasks, demon-\nstrate that the proposed method consistently achieves superior performance relative\nto competing state-of-the-art approaches.\n\nIntroduction\n\n1\nNetwork learning is challenging since graph structures are not directly amenable to standard machine\nlearning algorithms, which traditionally assume vector-valued inputs [4, 15]. Network embedding\ntechniques solve this issue by mapping a network into vertex-based low-dimensional vector represen-\ntations, which can then be readily used for various downstream network analysis tasks [10]. Due to its\neffectiveness and ef\ufb01ciency in representing large-scale networks, network embeddings have become\nan important tool in understanding network behavior and making predictions [24], thus attracting\nconsiderable research attention in recent years [31, 37, 16, 42, 8, 47, 40].\nExisting network embedding models can be roughly grouped into two categories. The \ufb01rst consists\nof models that only leverage the structural information (topology) of a network, e.g., available edges\n(links) across vertices. Prominent examples include classic deterministic graph factorizations [6, 1],\nprobabilistically formulated LINE [37], and diffusion-based models such as DeepWalk [31] and\nNode2Vec [16]. While widely applicable, these models are often vulnerable to violations to their\nunderlying assumptions, such as dense connections, and noise-free and complete (non-missing)\nobservations [30]. They also ignore rich side information commonly associated with vertices,\nprovided naturally in many real-world networks, e.g., labels, texts, images, etc. For example, in\nsocial networks users have pro\ufb01les, and in citation networks articles have text content (e.g., abstracts).\nModels from the second category exploit these additional attributes to improve both informativeness\nand robustness of network embeddings [49, 36]. More recently, models such as CANE [40] and\nWANE [34] advocate the use of contextualized network embeddings to increase representation\ncapacity, further enhancing performance in downstream tasks.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fstructure info\nsemantic info\n\nzi\n\n(null)\n(null)\n(null)\n(null)\n\n(null)\n(null)\n(null)\n\n?(null)\ni(null)\n\n(null)\n(null)\n(null)\n\nk(null)\n\n(null)\n(null)\n(null)\n\nl(null)\n\n(null)\n(null)\n(null)\n\nj(null)\n\n(null)\n(null)\n(null)\n\nxi\n\n(null)\n(null)\n(null)\n(null)\n\nk(null)\n\n(null)\n(null)\n(null)\n\n(null)\n(null)\n(null)\n(null)\n\nlink (wij = 1)\nzi\n\n(null)\n(null)\n(null)\n(null)\n\np1(null)\n\n(null)\n(null)\n(null)\n\nzi, zj\n\n(null)\n(null)\n(null)\n(null)\n\nxi, xj\n\n(null)\n(null)\n(null)\n(null)\n\nk(null)\n\n(null)\n(null)\n(null)\n\nzj\n\n(null)\n(null)\n(null)\n(null)\n\nl(null)\n\n(null)\n(null)\n(null)\n\nj(null)\n\n(null)\n(null)\n(null)\n\n(null)\n(null)\n(null)\n\n?(null)\ni(null)\n\n(null)\n(null)\n(null)\n\nno link (wik = 0)\n\n(null)\n(null)\n(null)\n(null)\n\nunknown(wil = N/A)\n\n(null)\n(null)\n(null)\n(null)\n\nzi\n\n(null)\n(null)\n(null)\n(null)\n\n(null)\n(null)\n(null)\n\n?(null)\ni(null)\n\n(null)\n(null)\n(null)\n\np0(null)\n\n(null)\n(null)\n(null)\n\nzi, zk\n\n(null)\n(null)\n(null)\n(null)\n\nxi, xk\n\n(null)\n(null)\n(null)\n(null)\n\nk(null)\n\n(null)\n(null)\n(null)\n\nzk(null)\n\n(null)\n(null)\n(null)\n\nl(null)\n\n(null)\n(null)\n(null)\n\nj(null)\n\n(null)\n(null)\n(null)\n\nzi\n\n(null)\n(null)\n(null)\n(null)\n\n\u21e10\u00b7p1 + (1 \u21e10)\u00b7 p0\n\n(null)\n(null)\n(null)\n(null)\n\nzi, zl\n\n(null)\n(null)\n(null)\n(null)\n\nxi, xl\n\n(null)\n(null)\n(null)\n(null)\n\nzl\n\n(null)\n(null)\n(null)\n(null)\n\nl(null)\n\n(null)\n(null)\n(null)\n\nj(null)\n\n(null)\n(null)\n(null)\n\n(null)\n(null)\n(null)\n\n?(null)\ni(null)\n\n(null)\n(null)\n(null)\n\n(a) Standard VAE\n\n(b) Variational Homophilic Embedding (VHE)\n\nFigure 1: Comparison of the generative processes between the standard VAE and the proposed VHE. (a) The\nstandard VAE models a single vertex xi in terms of latent zi. (b) VHE models pairs of vertices, by categorizing\ntheir connections into: (i) link, (ii) no link, and (iii) unknown. p1 is the (latent) prior for pairs of linked vertices,\np0 is the prior for those without link and wij indicates whether an edge between node i and node j is present.\nWhen wij = N/A, it will be sampled from a Bernoulli distribution parameterized by \u21e10.\n\nExisting solutions, however, almost exclusively focus on the use of discriminative objectives. Speci\ufb01-\ncally, models are trained to maximize the accuracy in predicting the network topology, i.e., edges.\nDespite their empirical success, this practice biases embeddings toward link-prediction accuracy,\npotentially compromising performance for other downstream tasks. Alternatively, generative mod-\nels [19], which aim to recover the data-generating mechanism and thereby characterize the latent\nstructure of the data, could potentially yield better embeddings [13]. This avenue still remains\nlargely unexplored in the context of network representation learning [23]. Among various generative\nmodeling techniques, the variational autoencoder (VAE) [21], which is formulated under a Bayesian\nparadigm and optimizes a lower bound of the data likelihood, has been established as one of the most\npopular solutions due to its \ufb02exibility, generality and strong performance [7]. The integration of such\nvariational objectives promises to improve the performance of network embeddings.\nThe standard VAE is formulated to model single data elements, i.e., vertices in a network, thus\nignoring their connections (edges); see Figure 1(a). Within the setting of network embeddings,\nwell-known principles underlying network formation [4] may be exploited. One such example is\nhomophily [29], which describes the tendency that edges are more likely to form between vertices\nthat share similarities in their accompanying attributes, e.g., pro\ufb01le or text. This behavior has been\nwidely validated in many real-world scenarios, prominently in social networks [29]. For networks\nwith complex attributes, such as text, homophilic similarity can be characterized more appropriately\nin some latent (semantic) space, rather than in the original data space. The challenge of leveraging\nhomophilic similarity for network embeddings largely remains uncharted, motivating our work that\nseeks to develop a novel form of VAE that encodes pairwise homophilic relations.\nIn order to incorporate homophily into our model design, we propose Variational Homophilic\nEmbedding (VHE), a novel variational treatment for modeling networks in terms of vertex pairs rather\nthan individual vertices; see Figure 1(b). While our approach is widely applicable to networks with\ngeneral attributes, in this work, we focus the discussion on its applications to textual networks, which\nis both challenging and has practical signi\ufb01cance. We highlight our contributions as follows: (i) A\nscalable variational formulation of network embeddings, accounting for both network topology and\nvertex attributes, together with model uncertainty estimation. (ii) A homophilic prior that leverages\nedge information to exploit pairwise similarities between vertices, facilitating the integration of\nstructural and attribute (semantic) information. (iii) A phrase-to-word alignment scheme to model\ntextual embeddings, ef\ufb01ciently capturing local semantic information across words in a phrase.\nCompared with existing state-of-the-art approaches, the proposed method allows for missing edges\nand generalizes to unseen vertices at test time. A comprehensive empirical evaluation reveals that our\nVHE consistently outperforms competing methods on real-world networks, spanning applications\nfrom link prediction to vertex classi\ufb01cation.\n2 Background\nNotation and concepts Let G = {V , E, X} be a network with attributes, where V = {vi}N\ni=1 is\ni=1 represents the side information\nthe set of vertices, E \u2713 V \u21e5 V denotes the edges and X = {xi}N\n(attributes) associated with each vertex. We consider the case for which X are given in the form\n\n2\n\n\fi , ..., xLi\ni\n\n], where each x`\n\nof text sequences, i.e., xi = [x1\ni is a word (or token) from a pre-speci\ufb01ed\nvocabulary. Without loss of generality, we assume the network is undirected, so that the edges E\ncan be compactly represented by a symmetric (nonnegative) matrix W 2{ 0, 1}N\u21e5N, where each\nelement wij represents the weight for the edge between vertices vi and vj. Here wij = 1 indicates\nthe presence of an edge between vertices vi and vj.\nVariational Autoencoder (VAE) In likelihood-based learning, one seeks to maximize the empirical\nexpectation of the log-likelihood 1\ni=1, where p\u2713(x)\nis the model likelihood parameterized by \u2713. In many cases, especially when modeling complex\ndata, latent-variable models of the form p\u2713(x, z) = p\u2713(x|z)p(z) are of interest, with p(z) the prior\ndistribution for latent code z and p\u2713(x|z) the conditional likelihood. Typically, the prior comes in the\nform of a simple distribution, such as (isotropic) Gaussian, while the complexity of data is captured\nby the conditional likelihood p\u2713(x|z). Since the marginal likelihood p\u2713(x) rarely has a closed-form\nexpression, the VAE seeks to maximize the following evidence lower bound (ELBO), which bounds\nthe marginal log-likelihood from below\n\nNPi log p\u2713(xi) w.r.t. training examples {xi}N\n\nlog p\u2713(x) L \u2713,(x) = Eq(z|x)[log p\u2713(x|z)] KL(q(z|x)||p(z)) ,\n\n(1)\nwhere q(z|x) is a (tractable) approximation to the (intractable) posterior p\u2713(z|x). Note the \ufb01rst\nconditional likelihood term can be interpreted as the (negative) reconstruction error, while the second\nKullback-Leibler (KL) divergence term can be viewed as a regularizer. Conceptually, the VAE\nencodes input data into a (low-dimensional) latent space and then decodes it back to reconstruct the\ninput. Hereafter, we will use the terms encoder and approximate posterior q(z|x) interchangeably,\nand similarly for the decoder and conditional likelihood p\u2713(x|z).\n3 Variational Homophilic Embedding\nTo ef\ufb01ciently encode both the topological (E) and semantic (X) information of network G, we\npropose a novel variational framework that models the joint likelihood p\u2713(xi, xj) for pairs of vertices\nvi and vj using a latent variable model, conditioned on their link pro\ufb01le, i.e., the existence of edge,\nvia W . Our model construction is elaborated on below, with additional details provided in the\nSupplementary Material (SM).\nA na\u00efve variational solution To motivate our model, we \ufb01rst consider a simple variational approach\nand discuss its limitations. A popular strategy used in the network embedding literature [10] is to\nsplit the embedding vector into two disjoint components: (i) a structural embedding, which accounts\nfor network topology; and (ii) a semantic embedding, which encodes vertex attributes. For the\nlatter we can simply apply VAE to learn the semantic embeddings by treating vertex data {xi}\nas independent entities and then obtain embeddings via approximate posterior q(zi|xi), which is\ni=1.\nlearned by optimizing the lower bound to log p\u2713(xi) in (1) for {xi}N\nSuch variationally learned semantic embeddings can be concatenated with structural embeddings\nderived from existing schemes (such as Node2Vec [16]) to compose the \ufb01nal vertex embedding.\nWhile this partly alleviates the issues we discussed above, a few caveats are readily noticed: (i) the\nstructural embedding still relies on the use of discriminative objectives; (ii) the structural and semantic\nembeddings are not trained under a uni\ufb01ed framework, but separately; and most importantly, (iii) the\nstructural information is ignored in the construction of semantic embeddings. In the following, we\ndevelop a fully generative approach based on the VAE that addresses these limitations.\n\n3.1 Formulation of VHE\nHomophilic prior\nInspired by the homophily phenomenon observed in real-world networks [29],\nwe propose to model pairs of vertex attributes with an inductive prior [5], such that for connected\nvertices, their embeddings will be similar (correlated). Unlike the na\u00efve VAE solution above, we\nnow consider modeling paired instances as p\u2713(xi, xj|wij), conditioned on their link pro\ufb01le wij. In\nparticular, we consider a model of the form\n\np\u2713(xi, xj|wij) =Z p\u2713(xi|zi)p\u2713(xj|zj)p(zi, zj|wij)dzidzj .\n\n(2)\n\nFor simplicity, we treat the triplets {xi, xj, wij} as independent observations. Note that xi and xj\nconform to the same latent space, as they share the same decoding distribution p\u2713(x|z). We wish to\nenforce the homophilic constraint, such that if vertices vi and vj are connected, similarities between\n\n3\n\n\fthe latent representations of xi and xj should be expected. To this end, we consider a homophilic\nprior de\ufb01ned as follows\n\np(zi, zj|wij) =\u21e2 p1(zi, zj),\n\np0(zi, zj),\n\nif wij = 1\nif wij = 0\n\nwhere p1(zi, zj) and p0(zi, zj) denote the priors with and without an edge between the vertices,\nrespectively. We want these priors to be intuitive and easy to compute with the ELBO, which leads to\nchoice of the following forms\n\np1(zi, zj) = N\u2713\uf8ff 0d\n\n0d ,\uf8ff I d\n\nI d\n\nI d\n\nI d \u25c6 ,\n\np0(zi, zj) = N\u2713\uf8ff 0d\n\n0d ,\uf8ff I d 0d\n\nI d \u25c6 ,\n\n0d\n\n(3)\n\nwhere N (\u00b7,\u00b7) is multivariate Gaussian, 0d denotes an all-zero vector or matrix depending on the\ncontext, I d is the identity matrix, and 2 [0, 1) is a hyper-parameter controlling the strength of\nthe expected similarity (in terms of correlation). Note that p0 is a special case of p1 when = 0,\nimplying the absence of homophily, while p1 accounts for the existence of homophily via , the\nhomophily factor. In Section 3.3, we will describe how to obtain embeddings for single vertices while\naddressing the computational challenges of doing it on large networks where evaluating all pairwise\ncomponents is prohibitive.\nPosterior approximation Now we consider the choice of approximate posterior for the paired latent\nvariables {zi, zj}. Note that with the homophilic prior p1(zi, zj), the use of an approximate posterior\nthat does not account for the correlation between the latent codes is inappropriate. Therefore, we\nconsider the following multivariate Gaussian to approximate the posterior\n\n2\ni\n\n2\nj\n\ni\n0d\n\n0d\n\u02c62\n\nijij\n\nijij\n\n\u25c6,\n\nj \u25c6 ,\n\n\u00b5j ,\uf8ff\n\n\u02c6\u00b5j ,\uf8ff \u02c62\n\nq0(zi, zj|xi, xj) \u21e0N \u2713\uf8ff \u02c6\u00b5i\n\nq1(zi, zj|xi, xj) \u21e0N \u2713\uf8ff \u00b5i\nwhere \u00b5i, \u00b5j, \u02c6\u00b5i, \u02c6\u00b5j 2 Rd and i, j, \u02c6i, \u02c6j 2 Rd\u21e5d are posterior means and (diagonal) covari-\nances, respectively, and ij 2 Rd\u21e5d, also diagonal, is the a posteriori homophily factor. Elements of\nij are assumed in [0, 1] to ensure the validity of the covariance matrix. Note that all these variables,\ndenoted collectively in the following as , are neural-network-based functions of the paired data\ntriplet {xi, xj, wij}. We omit their dependence on inputs for notational clarity.\nFor simplicity, below we take q1(zi, zj|xi, xj) as an example to illustrate the inference, and\nq0(zi, zj|xi, xj) is derived similarly. To compute the variational bound, we need to sample from\nthe posterior and back-propagate its parameter gradients. It can be veri\ufb01ed that the Cholesky de-\ncomposition [11] of the covariance matrix \u2303ij = LijL>ij of q1(zi, zj|xi, xj) in (4) takes the form\n\n(4)\n\nLij =\" i\n\n0d\n\nijj q1 2\n\nijj # ,\n\n(5)\n\nallowing sampling from the approximate posterior in (4) via\n\n[zi; zj] =\u21e5\u00b5i; \u00b5j\u21e4 + Lij\u270f, where \u270f \u21e0N (02d, I 2d),\n\n(6)\nwhere [\u00b7;\u00b7] denotes concatenation. This isolates the stochasticity in the sampling process and enables\neasy back-propagation of the parameter gradients from the likelihood term log p\u2713(xi, xj|zi, zj)\nwithout special treatment. Further, after some algebraic manipulations, the KL-term between the\nhomophilic posterior and prior can be derived in closed-form, omitted here for brevity (see the SM).\nThis gives us the ELBO of the VHE with complete observations as follows\n\nL\u2713,(xi, xj|wij) = wijEzi,zj\u21e0q1(zi,zj )[log p\u2713(xi, xj|zi, zj)] KL(q1(zi, zj) k p1(zi, zj))\n\n+ (1 wij)Ezi,zj\u21e0q0(zi,zj )[log p\u2713(xi, xj|zi, zj)] KL(q0(zi, zj) k p0(zi, zj)) .\n\nLearning with incomplete edge observations In real-world scenarios, complete vertex information\nmay not be available. To allow for incomplete edge observations, we also consider (where necessary)\nwij as latent variables that need to be inferred, and de\ufb01ne the prior for pairs without corresponding\nedge information as \u02dcp(zi, zj, wij) = p(zi, zj|wij)p(wij), where wij \u21e0B (\u21e10) is a Bernoulli\nvariable with parameter \u21e10, which can be \ufb01xed based on prior knowledge or estimated from data. For\ninference, we use the following approximate posterior\n\n(7)\n\n\u02dcq(zi, zj, wij|xi, xj) = q(zi, zj|xi, xj, wij)q(wij|xi, xj) ,\n\n(8)\n\n4\n\n\fM(null)\n\n(null)\n(null)\n(null)\n\nKi\n\n(null)\n(null)\n(null)\n(null)\n\nF i\n\n(null)\n(null)\n(null)\n(null)\n\n\u02dcwi\n\n(null)\n(null)\n(null)\n(null)\n\nwi\n\n(null)\n(null)\n(null)\n(null)\n\nxi\n\n(null)\n(null)\n(null)\n(null)\n\nText Embedding\n\n\u02dcL\u2713,(xi, xj) .\n\n\u02dcL\u2713,(xi, xj) = E\u02dcq(zi,zj ,wij|xi,xj )[log p\u2713(xi, xj|zi, zj, wij)] KL(\u02dcq(zi, zj, wij|xi, xj) k \u02dcp(zi, zj, wij)) .\n\nL\u2713, =P{xi,xj ,wij}2Do L\u2713,(xi, xj|wij) +P{xi,xj}2Du\n\nwhere q(wij|xi, xj) = B(\u21e1ij) with \u21e1ij 2 [0, 1], a neural-network-based function of the paired input\n{xi, xj}. Note that by integrating out wij, the approximate posterior for {zi, zj} is a mixture of two\nGaussians, and its corresponding ELBO is detailed in the SM, denoted as\n(9)\nVHE training Let Do = {xi, xj, wij|wij 2{ 0, 1}} and Du = {xi, xj, wij|wij = N/A} be the\nset of complete and incomplete observations, respectively. Our training objective can be written as\n(10)\nIn practice, it is dif\ufb01cult to distinguish vertices with no link from those with a missing link. Hence, we\npropose to randomly drop a small portion, \u21b5 2 [0, 1], of the edges with complete observations, thus\ntreating their corresponding vertex pairs as incomplete observations. Empirically, this uncertainty\nmodeling improves model robustness, boosting performance. Following standard practice, we draw\nmini-batches of data and use stochastic gradient ascent to learn model parameters {\u2713, } with (10).\n3.2 VHE for networks with textual attributes\nIn this section, we provide implementation details of VHE on textual networks as a concrete example.\nEncoder architecture The schematic diagram of our VHE encoder for textual networks is provided\nin Figure 2. Our encoder design utilizes (i) a novel phrase-to-word alignment-based text embedding\nmodule to extract context-dependent features from vertex text; (ii) a lookup-table-based structure\nembedding to capture topological features of vertices; and (iii) a neural integrator that combines\nsemantic and topological features to infer the approximate posterior of latent codes.\nPhrase-to-word alignment: Given the associ-\nated text on a pair of vertices, xi 2 Rdw\u21e5Li\nand xj 2 Rdw\u21e5Lj , where dw is the dimen-\nsion of word embeddings, and Li and Lj are\nthe length of each text sequence. We treat xj\nas the context of xi, and vice versa. Specif-\nically, we \ufb01rst compute token-wise similarity\nmatrix M = xT\ni xj 2 RLi\u21e5Lj . Next, we com-\npute row-wise and column-wise weight vectors\nbased on M to aggregate features for xi and xj.\nTo this end, we perform 1D convolution on M\nFigure 2: Illustration of the proposed VHE encoder. See\nboth row-wise and column-wise, followed by\nSM for a larger version of the text embedding module.\na tanh(\u00b7) activation, to capture phrase-to-word\nsimilarities. This results in F i 2 RLi\u21e5Kr and F j 2 RLj\u21e5Kc, where Kr and Kc are the number of \ufb01l-\nters for rows and columns, respectively. We then aggregate via max-pooling on the second dimension\nto combine the convolutional outputs, thus collapsing them into 1D arrays, i.e., \u02dcwi = max-pool(F i)\nand \u02dcwj = max-pool(F j). After softmax normalization, we have the phrase-to-word alignment\nvectors wi 2 RLi and wj 2 RLj . The \ufb01nal text embeddings are given by \u02dcxi = xiwi 2 Rdw, and\nsimilarly for \u02dcxj. Additional details are provided in the SM.\nStructure embedding and neural integrator: For each vertex vi, we assign a dw-dimensional learnable\nparameter hi as structure embedding, which seeks to encode the topological information of the\nvertex. The set of all structure embeddings H constitutes a look-up table for all vertices in G. The\naligned text embeddings and structure embeddings are concatenated into a feature vector f ij ,\n[\u02dcxi; \u02dcxj; hi; hj] 2 R4dw, which is fed into the neural integrator to obtain the posterior means (\u00b5i, \u00b5j,\n\u02c6\u00b5i, \u02c6\u00b5j), covariance (2\nj) and homophily factors ij. For pairs with missing edge\ninformation, i.e, wij = N/A, the neural integrator also outputs the posterior probability of edge\npresence, i.e., \u21e1ij. A standard multi-layer perceptron (MLP) is used for the neural integrator.\nDecoder architecture Key to the design of the decoder is the speci\ufb01cation of a conditional likeli-\nhood model from a latent code {zi, zj} to an observation {xi, xj}. Two choices can be considered:\n(i) direct reconstruction of the original text sequence (conditional multinomial likelihood), and (ii)\nindirect reconstruction of the text sequence in embedding space (conditional Gaussian likelihood).\nIn practice, the direct approach typically also encodes irrelevant nuisance information [18], thus we\nfollow the indirect approach. More speci\ufb01cally, we use the max-pooling feature\n\n\u02c6\u00b5i, \u02c6\u00b5j\nq(zi, zj|xi, xj)\n\ni , 2\n\nj, \u02c62\n\ni , \u02c62\n\nStructure Embedding\nLookup Table\n\nH(null)\n\nSample\n\nMLP\n\n\u00b5i, \u00b5j\n\n\u02c6i, \u02c6j\n\ni, j\n\n\u02dcxj(null)\n\n(null)\n(null)\n(null)\n\nhj(null)\n\n(null)\n(null)\n(null)\n\nxj(null)\n\n(null)\n(null)\n(null)\n\n\u02dcxi\n\n(null)\n(null)\n(null)\n(null)\n\nhi\n\n(null)\n(null)\n(null)\n(null)\n\nzj(null)\n\n(null)\n(null)\n(null)\n\nzi\n\n(null)\n(null)\n(null)\n(null)\n\n\u21e1ij\n\n(null)\n(null)\n(null)\n(null)\n\nij\n\n(null)\n\n(null)\n(null)\n(null)\n\n(null)\n\n(null)\n(null)\n(null)\n\nj(null)\n\n(null)\n(null)\n(null)\n\nmaxpool\n\nmaxpool\n\ni(null)\n\n(null)\n(null)\n(null)\n\nsoftmax\n\nsoftmax\n\nF j\n\n(null)\n(null)\n(null)\n(null)\n\nKj\n\n(null)\n(null)\n(null)\n(null)\n\n(null)\n(null)\n(null)\n(null)\n\n(null)\n(null)\n(null)\n(null)\n\n(null)\n(null)\n(null)\n(null)\n\n(null)\n(null)\n(null)\n(null)\n\n(null)\n(null)\n(null)\n(null)\n\n(null)\n(null)\n(null)\n(null)\n\n\u02dcwj\n\n(null)\n(null)\n(null)\n(null)\n\nwj\n\n(null)\n(null)\n(null)\n(null)\n\n(null)\n(null)\n(null)\n\n\u02daxi = max-pool(xi), \u02daxj = max-pool(xj) ,\n\n(11)\n\n5\n\n\fas the target representation, and let\n\n(12)\nwhere \u02c6x(z; \u2713) = f\u2713(z), is the reconstruction of \u02dax by passing posterior sample z through MLP f\u2713(z).\n\nlog p\u2713(xi, xj|zi, zj) = (||\u02daxi \u02c6xi(zi; \u2713)||2 + ||\u02daxj \u02c6xj(zj; \u2713)||2) ,\n\nInference at test time\n\n1\n\nNPxi2X\n\n3.3\nGlobal network embedding Above we have de\ufb01ned localized vertex embedding of vi by condition-\ning on another vertex vj (the context). For many network learning tasks, a global vertex embedding\nis desirable, i.e., without conditioning on any speci\ufb01c vertex. To this end, we identify the global\nvertex embedding distribution by simply averaging all the pairwise local embeddings p(zi|X) =\nN1Pj6=i q(zi|xi, xj, wij), where, with a slight abuse of notation, q(zi|xi, xj, wij) denotes the\napproximate posterior both with and without edge information (\u02dcq in (8)). In this study, we summarize\nN1Pj6=i E[q(zi|xi, xj, wij)] 2 Rd,\nthe distribution via expectations, i.e., \u00afzi = E[p(zi|X)] = 1\nwhich can be computed in closed form from (4). For large-scale networks, where the exact computa-\ntion of \u00afzi is computationally unfeasible, we use Monte Carlo estimates by subsampling {xj}j6=i.\nGeneralizing to unseen vertices Unlike most existing approaches, our generative formulation\ngeneralizes to unseen vertices. Assume we have a model with learned parameters (\u02c6\u2713, \u02c6), and learned\nstructure-embedding \u02c6H of vertices in the training set, hereafter collectively referred to as \u02c6\u21e5. For an\nunseen vertex v? with associated text data x?, we can learn its structure-embedding h? by optimizing\nit to maximize the average of variational bounds with the learned (global) parameters \ufb01xed, i.e.,\n\u02dcL\u02c6\u2713, \u02c6(x?, xi|h?, \u02c6H). Then the inference method above can be reused with the\nJ (h?) , 1\noptimized h? to obtain p(z?|X).\n4 Related Work\nNetwork Embedding Classical network embedding schemes mainly focused on the preservation\nof network topology (e.g., edges). For example, early developments explored direct low-rank\nfactorization of the af\ufb01nity matrix [1] or its Laplacian [6]. Alternative to such deterministic graph\nfactorization solutions, models such as LINE [37] employed a probabilistic formulation to account\nfor the uncertainty of edge information. Motivated by the success of word embedding in NLP,\nDeepWalk [31] applied the skip-gram model to the sampled diffusion paths, capturing the local\ninteractions between the vertices. More generally, higher-level properties such as community structure\ncan also be preserved with speci\ufb01c embedding schemes [47]. Generalizations to these schemes\ninclude Node2Vec [16], UltimateWalk [9], amongst others. Despite their wide-spread empirical\nsuccess, limitations of such topology-only embeddings have also been recognized. In real-world\nscenario, the observed edges are usually sparse relative to the number of all possible interactions, and\nsubstantial measurement error can be expected, violating the working assumptions of these models\n[30]. Additionally, these solutions typically cannot generalize to unseen vertices.\nFortunately, apart from the topology information, many real-world networks are also associated with\nrich side information (e.g., labels, texts, attributes, images, etc.), commonly known as attributes,\non each vertex. The exploration of this additional information, together with the topology-based\nembedding, has attracted recent research interest. For example, this can be achieved by accounting\nfor the explicit vertex labels [41], or by modeling the latent topics of the vertex content [38].\nAlternatively, [49] learns a topology-preserving embedding of the side information to factorize the\nDeepWalk diffusion matrix. To improve the \ufb02exibility of \ufb01xed-length embeddings, [40] instead treats\nside information as context and advocates the use of context-aware network embeddings (CANE).\nFrom the theoretical perspective, with additional technical conditions, formal inference procedures\ncan be established for such context-dependent embeddings, which guarantees favorable statistical\nproperties such as uniform consistency and asymptotic normality [48]. CANE has also been further\nimproved by using more \ufb01ne-grained word-alignment approaches [34].\nNotably, all methods discussed above have almost exclusivley focused on the use of discriminative\nobjectives. Compared with them, we presents a novel, fully generative model for summarizing both\nthe topological and semantic information of a network, which shows superior performance for link\nprediction, and better generalization capabilities to unseen vertices (see the Experiments section).\nVariational Autoencoder VAE [21] is a powerful framework for learning stochastic representations\nthat account for model uncertainty. While its applications have been extensively studied in the\n\n6\n\n\fTable 1: Summary of datasets used in evaluation.\nDatasets #vertices #edges %sparsity #labels\nCORA\nHEPTH\nZHIHU\n\ncontext of computer vision and NLP [44, 33, 50], its use in complex network analysis is less widely\nexplored. Existing solutions focused on building VAEs for the generation of a graph, but not the\nassociated contents [22, 35, 26]. Such practice amounts to a variational formulation of a discriminative\ngoal, thereby compromising more general downstream network learning tasks. To overcome such\nlimitations, we model pairwise data rather than a singleton as in standard VAE. Recent literature has\nalso started to explore priors other than standard Gaussian to improve model \ufb02exibility [12, 39, 46, 45],\nor enforce structural knowledge [2]. In our case, we have proposed a novel homophlic prior to exploit\nthe correlation in the latent representation of connected vertices.\n5 Experiments\nWe evaluate the proposed VHE on link prediction and vertex classi\ufb01cation tasks. Our code is available\nfrom https://github.com/Wenlin-Wang/VHE19.\nDatasets Following [40], we consider three widely\nstudied real-world network datasets: CORA [28],\nHEPTH [25], and ZHIHU1. CORA and HEPTH are\ncitation network datasets, and ZHIHU is a network\nderived from the largest Q&A website in China. Sum-\nmary statistics for these datasets are provided in Ta-\nble 1. To make direct comparison with existing work, we adopt the same pre-processing steps\ndescribed in [40, 34]. Details of the experimental setup are found in the SM.\nEvaluation metrics AUC score [17] is employed as the evaluation metric for link prediction. For\nvertex classi\ufb01cation, we follow [49] and build a linear SVM [14] on top of the learned network\nembedding to predict the label for each vertex. Various training ratios are considered, and for each,\nwe repeat the experiment 10 times and report the mean score and the standard deviation.\nBaselines To demonstrate the effectiveness of VHE, three groups of network embedding approaches\nare considered: (i) Structural-based methods, including MMB [3], LINE [37], Node2Vec [16] and\nDeepWalk [31]. (ii) Approaches that utilize both structural and semantic information, including\nTADW [49], CENE [36], CANE [40] and WANE [34]. (iii) VAE-based generative approaches,\nincluding the na\u00efve variational solution discussed in Section 3, using Node2Vec [16] as the off-the-\nshelf structural embedding (Na\u00efve-VAE), and VGAE [22]. To verify the effectiveness of the proposed\ngenerative objective and phrase-to-word alignment, a baseline model employing the same textual\nembedding as VHE, but with a discriminative objective [40], is also considered, denoted as PWA. For\nvertex classi\ufb01cation, we further compare with DMTE [51].\n5.1 Results and analysis\nTable 2: AUC scores for link prediction on three benchmark datasets. Each experiment is repeated 10 times,\nand the standard deviation is found in the SM, together with more detailed results.\n\n5, 214\n2, 277\n1, 038\n1, 990\n10, 000 43, 894\n\n0.10%\n0.18%\n0.04%\n\n7\n\n\n\nHEPTH\n\nZHIHU\n\nData\n\nCORA\n\nMMB [3]\nLINE [37]\n\nTADW [49]\nCENE [36]\nCANE [40]\nWANE [34]\nNa\u00efve-VAE\nVGAE [22]\n\n54.7 59.5 64.9 71.1 75.9\n55.0 66.4 77.6 85.6 89.3\nNode2Vec [16] 55.9 66.1 78.7 85.9 88.2\nDeepWalk [31] 56.0 70.2 80.1 85.3 90.3\n86.6 90.2 90.0 91.0 92.7\n72.1 84.6 89.4 93.9 95.9\n86.8 92.2 94.6 95.6 97.7\n91.7 94.1 96.2 97.5 99.1\n60.2 67.8 80.2 87.7 90.1\n63.9 74.3 84.3 88.1 90.5\n92.2 95.6 96.8 97.7 98.9\n94.4 97.6 98.3 99.0 99.4\n\n% Train Edges 15% 35% 55% 75% 95% 15% 35% 55% 75% 95% 15% 35% 55% 75% 95%\n51.0 53.7 61.6 68.8 72.4\n52.3 59.9 64.3 67.7 71.1\n54.2 57.3 58.7 66.2 68.5\n56.6 60.1 61.8 63.3 67.8\n52.3 55.6 60.8 65.2 69.0\n56.8 60.3 66.3 70.2 73.8\n56.8 62.9 68.9 71.4 75.4\n58.7 68.3 74.9 79.7 82.6\n56.5 60.2 62.5 68.1 69.0\n55.9 61.9 64.6 70.1 71.2\n62.6 70.8 77.1 80.8 83.3\n66.8 74.1 81.6 84.7 86.4\nLink Prediction Given the network, various ratios of observed edges are used for training and the\nrest are used for testing. The goal is to predict missing edges. Results are summarized in Table 2\n(more comprehensive results can be found in the SM). One may make several observations: (i)\nSemantic-aware methods are consistently better than approaches that only use structural information,\nindicating the importance of incorporating associated text sequences into network embeddings. (ii)\n\n54.6 57.3 66.2 73.6 80.3\n53.7 66.5 78.5 87.5 87.6\n57.1 69.9 84.3 88.4 89.2\n55.2 70.0 81.3 87.6 88.0\n87.0 91.8 91.1 93.5 91.7\n86.2 89.8 92.3 93.2 93.2\n90.0 92.0 94.2 95.4 96.3\n92.3 95.7 97.5 97.7 98.7\n60.8 68.1 80.7 88.8 90.5\n65.5 74.5 85.9 88.4 90.4\n92.8 96.1 97.6 97.9 99.0\n94.1 97.5 98.3 98.8 99.4\n\nPWA\nVHE\n\n1https://www.zhihu.com/\n\n7\n\n\fWhen comparing PWA with CANE [40] and WANE [34], it is observed that the proposed phrase-\nto-word alignment performs better than other competing textual embedding methods. (iii) Na\u00efve\nVAE solutions are less effective. Semantic information extracted from standard VAE (Na\u00efve-VAE)\nprovides only incremental improvements relative to structure-only approaches. VGAE [22] neglects\nthe semantic information, and cannot scale to large datasets (its performance is also subpar). (iv)\nVHE achieves consistently superior performance on all three datasets across different missing-data\nlevels, which suggests that VHE is an effective solution for learning network embeddings, especially\nwhen the network is sparse and large. As can be seen from the largest dataset ZHIHU, VHE achieves\nan average of 5.9 points improvement in AUC score relative to the prior state-of-the-art, WANE [34].\n\nLINE [37]\nCANE [40]\nTADW [49]\nWANE [34]\nDMTE [51]\n\nPWA\nVHE\n\nTable 3: Test accuracy for vertex classi\ufb01ca-\ntion on the CORA dataset.\n% of Labeled Data 10% 30% 50% 70%\nDeepWalk [31] 50.8 54.5 56.5 57.7\n53.9 56.7 58.8 60.1\n81.6 82.8 85.2 86.3\n71.0 71.4 75.9 77.2\n81.9 83.9 86.4 88.1\n81.8 83.9 86.4 88.1\n82.1 83.8 86.7 88.2\n82.6 84.3 87.7 88.5\n\nVertex Classi\ufb01cation The effectiveness of the learned\nnetwork embedding is further investigated on vertex clas-\nsi\ufb01cation. Similar to [40], learned embeddings are saved\nand then a SVM is built to predict the label for each vertex.\nBoth quantitative and qualitative results are provided, with\nthe former shown in Table 3. Similar to link prediction,\nsemantic-aware approaches, e.g., CENE [36], CANE [40],\nWANE [34], provide better performance than structure-\nonly approaches. Furthermore, VHE outperforms other\nstrong baselines as well as our PWA model, indicating\nthat VHE is capable of best leveraging the structural and\nsemantic information, resulting in robust network embed-\ndings. As qualitative analysis, we use t-SNE [27] to visualize the learned embeddings, as shown in\nFigure 4(a). Vertices belonging to different classes are well separated from each other.\nWhen does VHE works? VHE produces state-of-the-\nart results, and an additional question concerns analysis of\nwhen our VHE works better than previous discriminative\napproaches. Intuitively, VHE imposes strong structural\nconstraints, and could add to more robust estimation, espe-\ncially when the vertex connections are sparse. To validate\nthis hypothesis, we design the following experiment on\nZHIHU. When evaluating the model, we separate the test-\ning vertices into quantiles based on the number of edges\nof each vertex (degree), to compare VHE against PWA on\neach group. Results are summarized in Figure 3. VHE\nimproves link prediction for all groups of vertices, and\nthe gain is large especially when the interactions between\nvertices are rare, evidence that our proposed structural prior is a rational assumption and provides\nrobust learning of network embeddings. Also interesting is that prediction accuracy on groups with\nrare connections is no worse than those with dense connections. One possible explanation is that\nthe semantic information associated with the group of users with rare connections is more related\nto their true interests, hence it can be used to infer the connections accurately, while such semantic\ninformation could be noisy for those active users with dense connections.\nLink prediction on unseen vertices VHE can be further extended for learning embeddings for\nunseen vertices, which has not been well studied previously. To investigate this, we split the vertices\ninto training/testing sets with various ratios, and report the link prediction results on those unseen\n(testing) vertices. To evaluate the generalization ability of previous discriminative approaches to\nunseen vertices, two variants of CANE [40] and WANE [34] are considered as our baselines. (i) The\n\ufb01rst method ignores the structure embedding and purely relies on the semantic textual information\nto infer the edges, and therefore can be directly extended for unseen vertices (marked by \u2020). (ii)\nThe second approach learns an additional mapping from the semantic embedding to the structure\nembedding with an MLP during training. When testing on unseen vertices, it \ufb01rst infers the structure\nembedding from its semantic embedding, and then combines with the semantic embedding to predict\nthe existence of links (marked by \u2021). Results are provided in Table 4. Consistent with previous results,\nsemantic information is useful for link prediction. Though the number of vertices we observe is small,\ne.g., 15%, VHE and other semantic-aware approaches can predict the links reasonably well. Further,\nVHE consistently outperforms PWA, showing that the proposed variational approach used in VHE\nyields better generalization performance for unseen vertices than discriminative models.\n\nFigure 3: AUC as a function of vertex de-\ngree (quantiles). Error bars represent the stan-\ndard deviation.\n\n8\n\n\fTable 4: AUC scores under the setting with unseen vertices for link prediction. \u2020 denotes approaches using\nsemantic features only, \u2021 denotes methods using both semantic and structure features, and the structure features\nare inferred from the semantic features with a one-layer MLP.\n\nHEPTH\n\nZHIHU\n\nData\n\nCORA\n\n% Train Vertices 15% 35% 55% 75% 95% 15% 35% 55% 75% 95% 15% 35% 55% 75% 95%\n55.9 62.1 67.3 73.3 76.2\n56.0 61.5 66.9 73.5 76.3\n57.6 65.1 71.2 76.6 79.9\n57.8 65.2 70.8 76.5 80.2\n61.5 74.7 77.3 81.0 82.3\n62.0 75.0 77.4 80.9 82.4\n63.2 75.6 78.0 81.3 82.7\n\n84.2 88.0 91.2 93.6 94.7\n83.8 88.0 91.0 93.7 95.0\n86.6 88.4 92.8 93.8 95.2\n86.9 88.3 92.8 94.1 95.3\n87.2 90.2 93.1 95.2 96.1\n87.4 90.5 93.0 95.5 96.2\n90.2 92.6 94.8 96.6 97.7\n\n83.4 87.9 91.1 93.8 95.1\n83.1 86.8 90.4 93.9 95.2\n87.4 88.7 92.2 94.2 95.7\n87.0 88.8 92.5 95.4 95.7\n87.7 89.9 93.5 95.7 95.9\n87.8 90.1 93.3 95.8 96.0\n89.9 92.4 95.0 96.9 97.4\n\nCANE\u2020\nCANE\u2021\nWANE\u2020\nWANE\u2021\nPWA\u2020\nPWA\u2021\nVHE\n\n(b) \n\n(c) \u21b5\n\n(d) Structure embedding\n\n(a) t-SNE visualization\nFigure 4: (a) t-SNE visualization of the learned network embedding on the CORA dataset, labels are color\ncoded. (b, c) Sensitivity analysis of hyper-parameter and \u21b5, respectively. (d) Ablation study on the use of\nstructure embedding in the encoder. Results are reported using 95% training edges on the three datasets.\n5.2 Ablation study\nSensitivity analysis The homophily factor controls the strength of the linking information. To\nanalyze its impact on the performance of VHE, we conduct experiments with 95% training edges\non the CORA dataset. As observed from Figure 4(b), empirically, a larger is preferred. This is\nintuitive, since the ultimate goal is to predict structural information, and our VHE incorporates such\ninformation in the prior design. If is large, the structural information plays a more important role in\nthe objective, and the optimization of the ELBO in (7) will seek to accommodate such information. It\nis also interesting to note that VHE performs well even when = 0. In this case, embeddings are\npurely inferred from the semantic features learned from our model, and such semantic information\nmay have strong correlations with the structure information.\nIn Figure 4(c), we further investigate the sensitivity of our model to the dropout ratio \u21b5. With\na small dropout ratio (0 <\u21b5 \uf8ff 0.4), we observe consistent improvements over the no drop-out\nbaseline (\u21b5 = 0) across all datasets, demonstrating the effectiveness of uncertainty estimation for\nlink prediction. Even when the dropout ratio is \u21b5 = 1.0, the performance does not drop dramatically.\nWe hypothesize that this is because the VHE is able to discover the underlying missing edges given\nour homophilic prior design.\nStructure embedding Our encoder produces both semantic and structure-based embeddings for\neach vertex. We analyze the impact of the structure embedding. Experiments with and without\nstructure embeddings are performed on the three datasets. Results are shown in Figure 4(d). We\n\ufb01nd that without the structure embedding, the performance remains almost the same for the ZHIHU\ndataset. However, the AUC scores drops about 2 points for the other two datasets. It appears that\nthe impact of the structure embedding may vary across datasets. The semantic information in CORA\nand HEPTH may not fully re\ufb02ect its structure information, e.g., documents with similar semantic\ninformation are not necessary to cite each other.\n6 Conclusions\nWe have presented Variational Homophilic Embedding (VHE), a novel method to characterize\nrelationships between vertices in a network. VHE learns informative and robust network embeddings\nby leveraging semantic and structural information. Additionally, a powerful phrase-to-word alignment\napproach is introduced for textual embedding. Comprehensive experiments have been conducted on\nlink prediction and vertex-classi\ufb01cation tasks, and state-of-the-art results are achieved. Moreover, we\nprovide insights for the bene\ufb01ts brought by VHE, when compared with traditional discriminative\nmodels. It is of interest to investigate the use of VHE in more complex scenarios, such as learning\nnode embeddings for graph matching problems.\nAcknowledgement: This research was supported in part by DARPA, DOE, NIH, ONR and NSF.\n\n9\n\n\fReferences\n[1] Amr Ahmed, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski, and Alexander J Smola.\n\nDistributed large-scale natural graph factorization. In WWW, 2013.\n\n[2] Samuel Ainsworth, Nicholas Foti, Adrian KC Lee, and Emily Fox. Interpretable vaes for nonlinear group\n\nfactor analysis. arXiv preprint arXiv:1802.06765, 2018.\n\n[3] Edoardo M Airoldi, David M Blei, Stephen E Fienberg, and Eric P Xing. Mixed membership stochastic\n\nblockmodels. JMLR, 2008.\n\n[4] Albert-L\u00e1szl\u00f3 Barab\u00e1si et al. Network science. Cambridge university press, 2016.\n\n[5] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz\nMalinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive\nbiases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.\n\n[6] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and\n\nclustering. In NeurIPS, 2002.\n\n[7] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians.\n\nJournal of the American Statistical Association, 2017.\n\n[8] Jifan Chen, Qi Zhang, and Xuanjing Huang. Incorporate group information to enhance network embedding.\n\nIn KDD, 2016.\n\n[9] Siheng Chen, Sufeng Niu, Leman Akoglu, Jelena Kova\u02c7cevi\u00b4c, and Christos Faloutsos. Fast, warped graph\n\nembedding: Unifying framework and one-click algorithm. arXiv preprint arXiv:1702.05764, 2017.\n\n[10] Peng Cui, Xiao Wang, Jian Pei, and Wenwu Zhu. A survey on network embedding. IEEE Transactions on\n\nKnowledge and Data Engineering, 2018.\n\n[11] Dariusz Dereniowski and Marek Kubale. Cholesky factorization of matrices in parallel and ranking of\n\ngraphs. In PPAM, 2003.\n\n[12] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulku-\nmaran, and Murray Shanahan. Deep unsupervised clustering with gaussian mixture variational autoencoders.\narXiv preprint arXiv:1611.02648, 2016.\n\n[13] Jeff Donahue, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Adversarial feature learning. arXiv preprint\n\narXiv:1605.09782, 2016.\n\n[14] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library\n\nfor large linear classi\ufb01cation. JMLR, 2008.\n\n[15] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Springer\n\nseries in statistics New York, 2001.\n\n[16] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD, 2016.\n\n[17] James A Hanley and Barbara J McNeil. The meaning and use of the area under a receiver operating\n\ncharacteristic (roc) curve. Radiology, 1982.\n\n[18] Alon Jacovi, Oren Sar Shalom, and Yoav Goldberg. Understanding convolutional neural networks for text\n\nclassi\ufb01cation. arXiv preprint arXiv:1809.08037, 2018.\n\n[19] Tony Jebara. Machine learning: discriminative and generative, volume 755. Springer Science & Business\n\nMedia, 2012.\n\n[20] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[21] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\n[22] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308,\n\n2016.\n\n[23] Tuan MV Le and Hady W Lauw. Probabilistic latent document network embedding. In ICDM, 2014.\n\n10\n\n\f[24] Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex\nPeysakhovich. Pytorch-biggraph: A large-scale graph embedding system. arXiv preprint arXiv:1903.12287,\n2019.\n\n[25] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: densi\ufb01cation laws, shrinking\n\ndiameters and possible explanations. In KDD, 2005.\n\n[26] Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander Gaunt. Constrained graph variational\n\nautoencoders for molecule design. In NeurIPS, 2018.\n\n[27] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.\n\n[28] Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construc-\n\ntion of internet portals with machine learning. Information Retrieval, 2000.\n\n[29] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social\n\nnetworks. Annual review of sociology, 2001.\n\n[30] MEJ Newman. Network structure from rich but noisy data. Nature Physics, 2018.\n\n[31] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In\n\nKDD, 2014.\n\n[32] Kaare Brandt Petersen, Michael Syskind Pedersen, et al. The matrix cookbook. Technical University of\n\nDenmark, 2008.\n\n[33] Dinghan Shen, Qinliang Su, Paidamoyo Chapfuwa, Wenlin Wang, Guoyin Wang, Lawrence Carin, and\nRicardo Henao. Nash: Toward end-to-end neural architecture for generative semantic hashing. arXiv\npreprint arXiv:1805.05361, 2018.\n\n[34] Dinghan Shen, Xinyuan Zhang, Ricardo Henao, and Lawrence Carin. Improved semantic-aware network\n\nembedding with \ufb01ne-grained word alignment. EMNLP, 2018.\n\n[35] Martin Simonovsky and Nikos Komodakis. Graphvae: Towards generation of small graphs using variational\n\nautoencoders. In ICANN, 2018.\n\n[36] Xiaofei Sun, Jiang Guo, Xiao Ding, and Ting Liu. A general framework for content-enhanced network\n\nrepresentation learning. arXiv preprint arXiv:1610.02906, 2016.\n\n[37] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale\n\ninformation network embedding. In WWW, 2015.\n\n[38] Lei Tang and Huan Liu. Relational learning via latent social dimensions. In KDD, 2009.\n\n[39] Jakub M Tomczak and Max Welling. Vae with a vampprior. In AISTATS, 2018.\n\n[40] Cunchao Tu, Han Liu, Zhiyuan Liu, and Maosong Sun. Cane: Context-aware network embedding for\n\nrelation modeling. In ACL, 2017.\n\n[41] Cunchao Tu, Weicheng Zhang, Zhiyuan Liu, Maosong Sun, et al. Max-margin deepwalk: Discriminative\n\nlearning of network representation. In IJCAI, 2016.\n\n[42] Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In KDD, 2016.\n\n[43] Guoyin Wang, Chunyuan Li, Wenlin Wang, Yizhe Zhang, Dinghan Shen, Xinyuan Zhang, Ricardo Henao,\n\nand Lawrence Carin. Joint embedding of words and labels for text classi\ufb01cation. In ACL, 2018.\n\n[44] Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen, Jiaji Huang, Wei Ping, Sanjeev Satheesh, and\n\nLawrence Carin. Topic compositional neural language model. arXiv preprint arXiv:1712.09783, 2017.\n\n[45] Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Guoyin Wang, Dinghan Shen, Changyou Chen, and\n\nLawrence Carin. Topic-guided variational autoencoders for text generation. In NAACL, 2019.\n\n[46] Wenlin Wang, Yunchen Pu, Vinay Kumar Verma, Kai Fan, Yizhe Zhang, Changyou Chen, Piyush Rai, and\n\nLawrence Carin. Zero-shot learning via class-conditioned deep generative models. In AAAI, 2018.\n\n[47] Xiao Wang, Peng Cui, Jing Wang, Jian Pei, Wenwu Zhu, and Shiqiang Yang. Community preserving\n\nnetwork embedding. In AAAI, 2017.\n\n[48] Ting Yan, Binyan Jiang, Stephen E Fienberg, and Chenlei Leng. Statistical inference in a directed network\n\nmodel with covariates. JASA, 2018.\n\n11\n\n\f[49] Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Chang. Network representation learning\n\nwith rich text information. In IJCAI, 2015.\n\n[50] Qian Yang, Zhouyuan Huo, Dinghan Shen, Yong Cheng, Wenlin Wang, Guoyin Wang, and Lawrence\n\nCarin. An end-to-end generative architecture for paraphrase generation. EMNLP, 2019.\n\n[51] Xinyuan Zhang, Yitong Li, Dinghan Shen, and Lawrence Carin. Diffusion maps for textual network\n\nembedding. In NeurIPS, 2018.\n\n12\n\n\f", "award": [], "sourceid": 1222, "authors": [{"given_name": "Wenlin", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "Chenyang", "family_name": "Tao", "institution": "Duke University"}, {"given_name": "Zhe", "family_name": "Gan", "institution": "Microsoft"}, {"given_name": "Guoyin", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "Liqun", "family_name": "Chen", "institution": "Duke University"}, {"given_name": "Xinyuan", "family_name": "Zhang", "institution": "Duke University"}, {"given_name": "Ruiyi", "family_name": "Zhang", "institution": "Duke University"}, {"given_name": "Qian", "family_name": "Yang", "institution": "Duke University"}, {"given_name": "Ricardo", "family_name": "Henao", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}