{"title": "Scalable Deep Generative Relational Model with High-Order Node Dependence", "book": "Advances in Neural Information Processing Systems", "page_first": 12658, "page_last": 12668, "abstract": "In this work, we propose a probabilistic framework for relational data modelling and latent structure exploring. Given the possible feature information for the nodes in a network, our model builds up a deep architecture that can approximate to the possible nonlinear mappings between the nodes' feature information and latent representations. For each node, we incorporate all its neighborhoods' high-order structure information to generate latent representation, such that these latent representations are ``smooth'' in terms of the network. Since the latent representations are generated from Dirichlet distributions, we further develop a data augmentation trick to enable efficient Gibbs sampling for Ber-Poisson likelihood with Dirichlet random variables. Our model can be ready to apply to large sparse network as its computations cost scales to the number of positive links in the networks. The superior performance of our model is demonstrated through improved link prediction performance on a range of real-world datasets.", "full_text": "Scalable Deep Generative Relational Models\n\nwith High-Order Node Dependence\n\nXuhui Fan1, Bin Li2, Scott A. Sisson1, Caoyuan Li3, and Ling Chen3\n\n1School of Mathematics & Statistics, University of New South Wales, Sydney\n2Shanghai Key Lab of IIP & School of Computer Science, Fudan University\n\n3Faculty of Engineering and IT, University of Technology, Sydney\n\n{xuhui.fan, scott.sisson}@unsw.edu.au; libin@fudan.edu.cn\n\nAbstract\n\nWe propose a probabilistic framework for modelling and exploring the latent struc-\nture of relational data. Given feature information for the nodes in a network, the\nscalable deep generative relational model (SDREM) builds a deep network archi-\ntecture that can approximate potential nonlinear mappings between nodes\u2019 feature\ninformation and the nodes\u2019 latent representations. Our contribution is two-fold:\n(1) We incorporate high-order neighbourhood structure information to generate\nthe latent representations at each node, which vary smoothly over the network.\n(2) Due to the Dirichlet random variable structure of the latent representations,\nwe introduce a novel data augmentation trick which permits ef\ufb01cient Gibbs sam-\npling. The SDREM can be used for large sparse networks as its computational\ncost scales with the number of positive links. We demonstrate its competitive per-\nformance through improved link prediction performance on a range of real-world\ndatasets.\n\n1 Introduction\n\nBayesian relational models, which describe the pairwise interactions between nodes in a net-\nwork, have gained tremendous attention in recent years, with numerous methods developed to\nmodel the complex dependencies within relational data; in particular, probabilistic Bayesian meth-\nods [27, 18, 1, 25, 7, 6]. Such models have been applied to community detection [27, 17], collab-\norative \ufb01ltering [29, 23], knowledge graph completion [14] and protein-to-protein interactions [16].\nIn general, the goal of these Bayesian relational models is to discover the complex latent structure\nunderlying the relational data and predict the unknown pairwise links [9, 8].\nDespite improving the understanding of complex networks, existing models typically have one or\nmore weaknesses: (1) While data commonly exhibit high-order node dependencies within the net-\nwork, such dependencies are rarely modelled due to limited model capabilities; (2) Although a\nnode\u2019s feature information closely informs its latent representation, existing models are not suf\ufb01-\nciently \ufb02exible to describe these (potentially nonlinear) mappings well; (3) While some scalable\nnetwork modelling techniques (e.g. Ber-Poisson link functions [30, 36]) can help to reduce the\ncomputational complexity to the number of positive links, they require the elements of latent rep-\nresentations to be independently generated and cannot be used for modelling dependent variables\n(e.g. membership distributions on communities).\nIn order to address these challenges, we develop a probabilistic framework using a deep network ar-\nchitecture on the nodes to model the relational data. The proposed scalable deep generative relational\nmodel (SDREM) builds a deep network architecture to ef\ufb01ciently map the nodes\u2019 feature informa-\ntion to their latent representations. In particular, the latent representations are modelled via Dirichlet\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdistributions, which permits their interpretation as membership distributions on communities. Based\non the output latent representations (i.e. membership distributions) and an introduced community\ncompatibility matrix, the relational data is modelled through the Ber-Poisson link function [30, 36],\nfor which the computational cost scales with the number of positive links in the network.\nWe make two novel contributions: First, as the nodes\u2019 latent representations are Dirichlet random\nvariables, we incorporate the full neighbourhood\u2019s structure information into its concentration pa-\nrameters. In this way, high-order node dependence can be modelled well and can vary smoothly\nover the network. Second, we introduce a new data augmentation trick that enables ef\ufb01cient Gibbs\nsampling on the Ber-Poisson link function due to the Dirichlet random variable structure of the latent\nrepresentations. The SDREM can be used to analyse large sparse networks and may also be directly\napplied to other notable models to improve their scalability (e.g. the mixed-membership stochastic\nblockmodel (MMSB) [1] and its variants [22, 13, 19]).\nIn comparison to existing approaches, the SDREM has several advantages. (1) Modelling high-order\nnode dependence: Propagating information between nodes\u2019 connected neighbourhoods can improve\ninformation sharing and dependence modelling between nodes. Also, it can largely reduce computa-\ntional costs in contrast to considering all the pairwise nodes\u2019 dependence, as well as avoid spurious\nor redundant information complications from unrelated nodes. Moreover, the non-linear real-value\npropagation in the deep network architecture can help to approximate the complex nonlinear map-\nping between the node\u2019s feature information and its latent representations. (2) Scalable modelling\non relational data: Our novel data augmentation trick permits an ef\ufb01cient Gibbs sampling imple-\nmentation, with computational costs scaling with the number of positive network links only. (3)\nMeaningful layer-wise latent representation: Since the nodes\u2019 latent representations are generated\nfrom Dirichlet distributions, they are naturally interpretable as the nodes\u2019 memberships over latent\ncommunities.\nIn our analyses on a range of real-world relational datasets, we demonstrate that the SDREM can\nachieve superior performance compared to traditional Bayesian methods for relational data, and\nperform competitively with other approaches. As the SDREM is the \ufb01rst Bayesian relational model\nto use neighbourhood-wise propagation to build the deep network architecture, we note that it may\nstraightforwardly integrate other Bayesian methods for modelling high-order node dependencies in\nrelational data, and further improve relationship predictability.\n\n2 Scalable Deep Generative Relational Models (SDREMs)\nThe relational data in the SDREM is represented as a binary matrix RRR 2 f0; 1gN(cid:2)N , where N is\nthe number of nodes and the element Rij (8i; j) indicates whether node i relates to node j (Rij = 1\nif the relation exists, otherwise Rij = 0), with the self-connection relation Rii not considered here.\nThe matrix RRR can be symmetric (i.e. undirected) or asymmetric (i.e. directed). The network\u2019s feature\ninformation is denoted by a non-negative matrix FFF 2 fR+ [ 0gN(cid:2)D, where D denotes the number\nof features, and where each element Fid (8i; d) takes the value of the d-th feature for the i-th node.\nThe deep network architecture of the SDREM is controlled by two parameters: L, representing the\nnumber of layers, and K, denoting the length of the nodes\u2019 latent representation in each layer. The\nlatent representation (cid:25)(cid:25)(cid:25)(l)\ni of node i in the l-th layer is a Dirichlet random variable (i.e. a normalised\nvector with (K(cid:0)1) active elements). In this way, (cid:25)(cid:25)(cid:25)(l)\n, which we term the \u201cmembership distribution\u201d,\nis interpretable as node i\u2019s community distribution, where K communities are modelled and (cid:25)(l)\nik\ndenotes node i\u2019s interaction with the k-th community in the l-th layer.\nThe deep network architecture of the SDREM is composed of three parts: (1) The input layer feeding\nthe feature information; (2) The hidden layers modelling high-order node dependences; (3) The\noutput layer of the relational data model. These component parts are detailed below.\n\ni\n\n2.1 Feeding the feature information\n\nWhen nodes\u2019 feature information is available, we introduce a feature-to-community transition coef-\n\ufb01cient matrix TTT 2 (R+)D(cid:2)K, where Tdk indicates the activity of the d-th feature in contributing to\nthe k-th latent community. The linear sum of the transition coef\ufb01cients TTT and feature FFF forms the\n\n2\n\n\fFigure 1: Illustration and visualization of a SDREM on a 5-node (i.e. A; B; C; D; E) directed network. Left:\nthe graphical model of a 3-layer SDREM modelling RBA; RED. Shaded nodes (i.e. F(cid:1); R(cid:1)) denote variables\nwith known values, unshaded nodes denote latent variables. Right top: the generative process of a SDREM.\nRight bottom: the directed connection types of all 5 nodes.\n\nprior for the nodes\u2019 \ufb01rst layer membership distribution\n) 8d; k; (cid:25)(cid:25)(cid:25)(1)\n\nTdk (cid:24) Gam((cid:13)(1)\nd ;\n\ni\n\n1\nc(1)\n\n(1)\nwhere Gam((cid:13); 1=c) denotes a gamma random variable with mean (cid:13)=c and variance (cid:13)=c2; f(cid:13)(1)\ngd\nand c(1) are the hyper-parameters for generating fTdkgd;k. From Eq. (1), nodes with close feature\ninformation have similar prior knowledge and similar generated membership distributions. A sup-\nplementary contribution (cid:11) is included in case that a node has no feature information available. For\n(cid:24) Dirichlet((cid:11)(cid:1) 1111(cid:2)K), which is a common setting\nnode i without feature information, we have (cid:25)(cid:25)(cid:25)(1)\nin Bayesian relational data modelling.\n\nd\n\ni\n\n(cid:24) Dirichlet(FFF iTTT + (cid:11)) 8i:\n\n2.2 Modelling high-order node dependence\n\nHigh-order node dependence is modelled within the deep network architecture of the SDREM. In\ngeneral, node i\u2019s membership distribution (cid:25)(cid:25)(cid:25)(l)\nis conditioned on the membership distributions at the\n(l (cid:0) 1)-th layer via an information propagation matrix BBB(l(cid:0)1) 2 fR+ [ 0gN(cid:2)N :\ni\n\n8<: (cid:24) Gam((cid:13)(l)\n\n(cid:24) Gam((cid:13)(l)\n= 0\n\nB(l(cid:0)1)\ni\u2032i\n\n1 ; 1\n0 ; 1\n\nc(l) )\nc(l) )\n\nif Ri\u2032i = 1;\n\u2032\nif i\notherwise;\n\n= i;\n\n(cid:24) Dirichlet((BBB(l(cid:0)1)\n\n(cid:1)i\n\n\u22a4 (cid:1) (cid:25)(cid:25)(cid:25)(l(cid:0)1)\n1:N );\n)\n\n(cid:25)(cid:25)(cid:25)(l)\ni\n\n(2)\n\ni\u2032i\n\n1 ; (cid:13)(l)\n\n0\n\n). B(l(cid:0)1)\n\n(cid:24) Gam(e(l)\n\ndenotes node i\ni\u2032i will make (cid:25)(cid:25)(cid:25)(l)\n\nFollowing [35], we set the hyper-parameter distribution as (cid:13)(l)\n\n); c(l) (cid:24)\n\u2032\u2019s in\ufb02uence on node i from the (l (cid:0) 1)-th to the l-th layer\nGam(g0; 1\nh0\n2 fR+gN(cid:2)K denotes\n(e.g. larger values of B(l(cid:0)1)\ni more similar to (cid:25)(cid:25)(cid:25)(l(cid:0)1)\nthe matrix of N nodes\u2019 membership distributions at the l-th layer. When there is no direct con-\n\u2032 \u0338= i \\ Ri\u2032i = 0), we restrict the corresponding information\nnection from node i\npropagation coef\ufb01cients Bi\u2032i at all layers to be 0; otherwise, we generate B(l(cid:0)1)\neither from a node\nand layer-speci\ufb01ed Gamma distribution (when Ri\u2032i = 1) or a layer-speci\ufb01ed Gamma distribution\n\u2032\n(when i\n= i). This can produce various bene\ufb01ts. On one hand, it promotes the sparseness of BBB(l)\nand reduces the cost of calculating BBB(l) from O(N 2) to the scale of the number of positive network\nlinks. On the other hand, since the SDREM uses a Dirichlet distribution (parameterised by the linear\n\n\u2032 to node i (i.e. i\n\n) and (cid:25)(cid:25)(cid:25)(l)\n1:N\n\n0 ; 1\nf (l)\n0\n\ni\u2032i\n\ni\u2032\n\n3\n\n\u03c0(1)D\u03c0(3)D\u03c0(2)C\u03c0(3)E\u03c0(3)CRED\u03c0(1)E\u03c0(3)A\u03c0(2)DM\u039b\u03c0(2)BFAZED\u03c0(1)BXEFEB(1)\u03c0(2)AFCRBAT,\u03b1ZBAB(2)\u03c0(2)EXBFBFDXDXA\u03c0(1)C\u03c0(1)A\u03c0(3)BHere,thepriordistributionforgeneratingXikandthelikelihoodbasedonXikarebothPoissondistributions.Consequently,wemayimplementposteriorsamplingbyusingTouchardpolynomi-als[25](detailsinSection3).Tomodelbinaryorcountdata,theBer-Poissonlikelihood[24,30]decomposesthelatentcountingvectorXXXiintothelatentintegermatrixZZZij.Anappealingpropertyofthisconstructionisthatwedonotneedtocalculatethelatentintegersfzij;k1k2gk1;k2overthe0-valuedRijdataastheyareequalto0almostsurely.Hence,thefocuscanbeonthepositive-valuedrelationaldata.Thisisparticularlyusefulforreal-worldnetworkdataasusuallyonlyasmallfractionofthedataispositive.Hence,thecomputationalcostforinferencescalesonlywiththenumberofpositiverelationallinks.Whennodes\u2019featureinformationisnotavailable(i.e.FFF=0N(cid:2)D)andL=1,theSDREMsimpli-\ufb01estothesamesettingsastheMMSB[1].Inparticular,themembershipdistributionsofboththeMMSBandtheSDREMfollowthesameDirichletdistributionf(cid:25)(cid:25)(cid:25)igi(cid:24)Dirichlet((cid:11)1(cid:2)K).AstheMMSBanditsvariants[18,10,16]introducepairwiselatentlabelsforalltherelationaldata(both1and0-valueddata),itrequiresacomputationalcostofO(N2)toinferalllatentvariables.Incon-trast,ournoveldataaugmentationtrickcanbestraightforwardlyappliedinthesemodels(bysimplyreplacingtheBer-Betalikelihood[21,15]withBer-Poissonlikelihood)andreducetheircomputa-tionalcosttothescaleofthenumberofpositivelinks.WeshowinSection5thatwecanalsogetbetterpredictiveperformancewiththisstrategy.2.4ModelsummaryThefullgenerativeprocessofSDREMissummarizedas(seevisualizationinFigure1):Throughintroducingsomeauxiliaryvariables,allthselatentvariablescanbeinferredthroughclosed-formGibbssampling.ThissectionmainlyThroughintroducingsomeauxiliaryvariables,allthselatent(1)Tdk(cid:24)Gam((cid:13)(1)d;1c(1));(cid:25)(cid:25)(cid:25)(1)i(cid:24)Dirichlet(FFFiTTT+(cid:11));(2)Forl=2;:::;L(cid:15)B(l(cid:0)1)i\u2032i8<:(cid:24)Gam((cid:13)(l)1;1c(l));i\u2032:Ri\u2032i=1;(cid:24)Gam((cid:13)(l)0;1c(l));i\u2032:i\u2032=i;=0;otherwise;;(cid:15)(cid:25)(cid:25)(cid:25)(l)i(cid:24)Dirichlet((B(l(cid:0)1)(cid:1)i)\u22a4(cid:1)(cid:25)(cid:25)(cid:25)(l(cid:0)1)1:N).(3)Mi(cid:24)Poisson(M);(Xi1;:::;XiK)(cid:24)Multi(Mi;(cid:25)(L)i1;:::;(cid:25)(L)iK);(4)(cid:3)k1k2(cid:24)Gam(k(cid:3);1(cid:18)(cid:3));(5)Zij;k1k2(cid:24)Poisson(Xik1(cid:3)k1k2Xjk2);(6)Rij=111(\u2211k1;k2Zij;k1k2>0).Throughintroducingsomeauxiliaryvariables,allthselatentvariablescanbeinferredthroughclosed-formGibbssampling.ThissectionmainlyfocusesontheinferenceoffXikgi;k,thekeyvariablesofgeneratingthelatentintegers.ThesamplingonothervariableseitherfollowssimilarroutinesofTopicmodels-focusedmethodsGBN[31]andDirBN[29]orrequiretrivialefforts.Weprovidethefullsamplingschemeinthesupplementarymaterial.wheref(cid:13)(1)fgf;fc(l)gl;f(cid:13)(l)igi;l;fk(cid:3);(cid:18)(cid:3)g;(cid:11);Marethehyper-parametersofthemodel.Throughintroducingsomeauxiliaryvariables,allthselatentvariablescanbeinferredthroughclosed-formGibbssampling.ThissectionmainlyfocusesontheinferenceoffXikgi;k,thekeyvariablesofgeneratingthelatentintegers.ThesamplingonothervariableseitherfollowssimilarroutinesofTopicmodels-focusedmethodsGBN[31]andDirBN[29]orrequiretrivialefforts.Weprovidethefullsamplingschemeinthesupplementarymaterial.Throughintroducingsomeauxiliaryvariables,allthselatentvariablescanbeinferredthroughclosed-formGibbssampling.ThissectionmainlyfocusesontheinferenceoffXikgi;k,thekeyvariablesofgeneratingthelatentintegers.ThesamplingonothervariableseitherfollowssimilarroutinesofTopicmodels-focusedmethodsGBN[31]andDirBN[29]orrequiretrivialefforts.Weprovidethefullsamplingschemeinthesupplementarymaterial.5BACED\fi\n\ni\n\n\u2032\n\n(cid:0)1(BBB(l\n))\n\n\u2032\n\n))\n\n\u22a4\n\n](cid:25)(cid:25)(cid:25)(1)\n\n\u2211\n\n1:N ] = (cid:25)(cid:25)(cid:25)(1)\n\n\u220f\nis conditioned on (cid:25)(cid:25)(cid:25)(l(cid:0)1)\n\n1:N . In the SDREM, we have E[(cid:25)(cid:25)(cid:25)(l)\ni\u2032 B(l)\n\nsum of node i\u2019s neighbourhoods\u2019 membership distributions at the (l (cid:0) 1)-th layer) to generate (cid:25)(cid:25)(cid:25)(l)\n,\nall the nodes\u2019 membership distributions are expected to vary smoothly over the connected graph\nstructure. That is, connected nodes are expected to have more similar membership distributions than\nunconnected ones.\nFlexibility in modelling variance and covariance in membership distributions Neighbourhood-\nwise information propagation allows for more \ufb02exible modelling than the extreme case of inde-\nonly (i.e. fBBB(l)gl is a diagonal matrix).\npendent propagation whereby (cid:25)(cid:25)(cid:25)(l)\ni\nUnder independent propagation, the expected membership distribution at each layer does not change:\nl(cid:0)1\nE[(cid:25)(cid:25)(cid:25)(l)\n1:N , where D(l)\nl\u2032=1(D(l\n1:N ] = [\ni\u2032i , 8i. Based on different choices for fBBB(l)gl, the ex-\nis a level l diagonal matrix with D(l)\npected mean of each node\u2019s membership distribution can incorporate information from other nodes\u2019\ninput layer. In terms of variance and covariance within each (cid:25)(cid:25)(cid:25)(l)\n, independent propagation is re-\ni\nstricted to inducing a larger variance in (cid:25)(l)\ndue to\nthe layer stacking architecture (this can be easily veri\ufb01ed through the law of total variance and the\nlaw of total covariance). In contrast, for the SDREM, these variances and covariances can be made\neither large or small depending on the choices of fBBB(l)gl through the deep network architecture.\nThe Dirichlet distribution models the membership distribution f(cid:25)(cid:25)(cid:25)(l)\ngi;l in a non-linear way. As\nnon-linearities are easily captured via deep learning, it is expected that the deep network architec-\nture in the SDREM can approximate the complex nonlinear mapping between the nodes\u2019 feature\ninformation and membership distributions suf\ufb01ciently well. Further, the technique of propagating\nreal-valued distributions through different layers might be a promising alternative to sigmoid belief\nnetworks [10, 11, 15], which mainly propagate binary variables between different layers.\n\nik and smaller covariance between (cid:25)(l)\nik1\n\nand (cid:25)(l)\nik2\n\nii =\n\ni\n\nComparison with spatial graph convolutional networks: Propagating information through\nneighbourhoods works in a similar spirit to the spatial graph convolutional network (GCN) [2, 5,\n12, 3] in a frequentist setting. In addition to providing variability estimates for all latent variables\nand predictions, the SDREM may conveniently incorporate beliefs on the parameters and exploit\nthe rich structure within the data. Beyond the likelihood function, the SDREM uses a Dirichlet\ndistribution as the activation function, whereas GCN algorithms usually use the logistic function.\nThe resulting membership distribution representation of the SDREM may provide a more intuitive\ninterpretation than the node representation (node embedding) in the GCN.\n\n2.3 Scalable relational data modelling\n\n(cid:0)\u2211\n\nk1 k2\n\nWe model the \ufb01nal-layer relational data via the Ber-Poisson link function [30, 36], Rij (cid:24)\nBernoulli(1 (cid:0) e\nXik1 (cid:3)k1 k2 Xjk2 ), where Xik is the latent count of node i on community k\n2 R+ is a compatibility value between communities k1 and k2. In existing work with the\nand (cid:3)k1k2\nBer-Poisson link function, all of the fXikgi;k terms are required to be independently generated (ei-\nther from a Gamma [36, 34] or Bernoulli distribution [15]) to allow for ef\ufb01cient Gibbs sampling.\nHowever, in the SDREM, the elements of the output latent representation ((cid:25)i1; : : : ; (cid:25)iK) are jointly\ngenerated from a Dirichlet distribution. These normalised elements are dependent on each other and\nit is not easy to enable Gibbs sampling for each individual element f(cid:25)ikgk.\nTo address this problem, we use a decomposition strategy to isolate the elements f(cid:25)ikgk. We use\nmultinomial distributions, with f(cid:25)(cid:25)(cid:25)igi as event probabilities, to generate K-length counting vectors\nfXXX igi. Each XXX i can be regarded as an estimator of (cid:25)(cid:25)(cid:25)i. Since the sum of the fXikgk is \ufb01xed as\nthe number of trials (denoted as Mi) in the multinomial distribution, we further let Mi be gener-\nated as Mi (cid:24) Poisson(M ). Based on the Poisson-Multinomial equivalence [4], each Xik is then\nequivalently distributed Xik (cid:24) Poisson(M (cid:25)ik).\nFollowing the settings of Ber-Poisson link function, a latent integer matrix ZZZ ij 2 NK(cid:2)K is intro-\n(cid:24) Poisson(Xik1 (cid:3)k1k2Xjk2 ). Rij is then generated by\nduced, where the (k1; k2)-th entry is Zij;k1k2\n\n4\n\n\fevaluating the degree of positivity of the matrix Zij. That is, 8(i; j); k1; k2:\nMi (cid:24) Poisson(M );\n\n\u2211\n(Xi1; : : : ; XiK) (cid:24) Multi(Mi; (cid:25)(L)\n\ni1 ; : : : ; (cid:25)(L)\n\niK ); (cid:3)k1k2\n\nZij;k1k2\n\n(cid:24) Poisson(Xik1(cid:3)k1k2 Xjk2)\n\nand Rij = 111(\n\nZij;k1k2 > 0):\n\nk1;k2\n\n(cid:24) Gam(k(cid:3);\n\n1\n(cid:18)(cid:3)\n\n);\n\n(3)\n\nHere, the prior distribution for generating Xik and the likelihood based on Xik are both Poisson\ndistributions. Consequently, we may implement posterior sampling by using Touchard polynomi-\nals [31] (details in Section 3).\nTo model binary or count data, the Ber-Poisson link function [30, 36] decomposes the latent counting\nvector XXX i into the latent integer matrix ZZZ ij. An appealing property of this construction is that we do\nnot need to calculate the latent integers fzij;k1k2\ngk1;k2 over the 0-valued Rij data as they are equal\nto 0 almost surely. Hence, the focus can be on the positive-valued relational data. This is particularly\nuseful for real-world network data as usually only a small fraction of the data is positive. Hence, the\ncomputational cost for inference scales only with the number of positive relational links.\nWhen nodes\u2019 feature information is not available (i.e. FFF = 0N(cid:2)D) and L = 1, the SDREM reduces\nto the same settings as the MMSB [1]. In particular, the membership distributions of both the MMSB\nand the SDREM follow the same Dirichlet distribution f(cid:25)(cid:25)(cid:25)igi (cid:24) Dirichlet((cid:11)1(cid:2)K). As the MMSB\nand its variants [22, 13, 19] introduce pairwise latent labels for all the relational data (both 1 and 0-\nvalued data), it requires a computational cost of O(N 2) to infer all latent variables. In contrast, our\nnovel data augmentation trick can be straightforwardly applied in these models (by simply replacing\nthe Ber-Beta likelihood [27, 18] with Ber-Poisson link function) and reduce their computational\ncost to the scale of the number of positive links. We show in Section 5 that we can also get better\npredictive performance with this strategy.\n\n3 Inference\n\nThe joint distribution of the relational data and all latent variables in the SDREM is:\ngi;j;k1;k2 ;fRijgi;j;fXikgi;k; TTTjFFF ; (cid:13)(cid:13)(cid:13); ccc; (cid:11); M; k(cid:3); (cid:18)(cid:3))\n\n]\n\ni\n\n[\nn\u220f\nP (f(cid:25)(cid:25)(cid:25)(l)\n24\u220f\n\ni=1\n\n(cid:2)\n\n]\nj(cid:11); FFF i; TTT )\n\n[\nL(cid:0)1\u220f\ngi;l;fBBB(l)gl; (cid:3)(cid:3)(cid:3);fZij;k1k2\n3524 \u220f\nP (BBB(l)j(cid:13)(l)\n\nP (Xikj(cid:25)(L)\n\nik ; M )\n\nl=1\n\ni\n\ni\n\nP ((cid:25)(cid:25)(cid:25)(1)\n\n=\n\ni;k\n\n(i;j)jRij =1;k1;k2\n\nn\u220f\n\ni=1\n\nP (Zij;k1k2\n\n3524\u220f\n\nf;k\n\n; c(l))\n\nP ((cid:25)(cid:25)(cid:25)(l+1)\n\ni\n\njf(cid:25)(cid:25)(cid:25)(l)\n\ni\u2032 gi\u2032:Ri\u2032 i=1; (cid:25)(cid:25)(cid:25)(l)\n\ni\n\n; BBB(l))\n\nP ((cid:3)(cid:3)(cid:3)jk(cid:3); (cid:18)(cid:3))\n\njXik1 ; Xjk2 ; (cid:3)k1k2 )\n\nP (Tdkj(cid:13)(1)\n\nf ; c(1))\n\n35 :\n\ni\n\n(4)\nBy introducing auxiliary variables, all latent variables can be sampled via ef\ufb01cient Gibbs sampling.\nThis section focuses on inference for fXikgi;k, which is the key variable involving the data augmen-\ntation trick. Sampling the membership distributions f(cid:25)(cid:25)(cid:25)(l)\ngi;l is as implemented in Gamma Belief\nNetworks [37] and Dirichlet Belief Networks [35], which mainly use a bottom-up mechanism to\npropagate the latent count information in each layer. As sampling the other variables is trivial, we\nrelegate the full sampling scheme to the Supplementary Material (Appendix A).\nSampling fXikgi;k: From the Poisson-Multinomial equivalence [4] we have Mi (cid:24) Poisson(M ),\n\n(Xi1; : : : ; XiK) (cid:24) Multi(Mi; (cid:25)(L)\n\niK ) d= Xik (cid:24) Poisson(M (cid:25)(L)\nBoth the prior distribution for generating Xik and the likelihood parametrised by Xik are Poisson\ndistributions. The full conditional distribution of Xik (assuming zii;(cid:1)(cid:1) = 0;8i) is then\n\u2211\nP (XikjM; (cid:25)(cid:25)(cid:25); (cid:3)(cid:3)(cid:3); ZZZ) /\n\ni1 ; : : : ; (cid:25)(L)\n\nik );8k:\n\nM (cid:25)(L)\nik e\n\n(cid:0)\u2211\n\nZj2 i;k1k :\n\nZij1 ;kk2\n\nj\u0338=i;k2\n\n+(cid:3)k2k)\n\n(Xik)\n\n((cid:3)kk2\n\n\u2211\n\n]\n\n[\n\nXjk2\n\nj1;k2\n\nj2;k1\n\nXik\n\n+\n\nThis follows the form of Touchard polynomials [31], where 1 =\n\n\u2211\navailable by comparing a Uniform(0; 1) random variable to the cumulative sum of f\n\n(5)\nxkkn\nk! with Tn(x) =\ng is the Stirling number of the second kind. A draw from (5) is then\ngk.\n\ngxk and where fn\nk\n\n(cid:1) xkkn\n\nfn\nk\n\nexTn(x)\n\nn\nk=0\n\nk=0\n\n1\n\n1\n\n\u22111\n\nXik!\n\nexTn(x)\n\nk!\n\n5\n\n\f4 Related Work\n\nThere is a long history of using Bayesian methods for relational data. Usually, these models build\nlatent representations for the nodes and use the interactions between these representations to model\nthe relational data. Typical examples include the stochastic blockmodel [27, 26, 18] (which uses\nlatent labels), the mixed-membership stochastic blockmodel (MMSB) [1, 22] (which uses mem-\nbership distributions) and the latent feature relational model (LFRM) [25, 28] (which uses binary\nlatent features). As most of these approaches are constructed using shallow models, their modelling\ncapability is limited.\nThe Multiscale-MMSB [13] is a related model, which uses a nested-Chinese Restaurant Process\nto construct hierarchical community structures. However, its tree-type structure is quite compli-\ncated and hard to implement ef\ufb01ciently. The Nonparametric Metadata Dependent Relational model\n(NMDR) [19] and the Node Attribute Relational Model (NARM) [34] also use the idea of transform-\ning nodes\u2019 feature information to nodes\u2019 latent representations. However, because of their shallow\nlatent representation, these methods are unable to describe higher-order node dependencies.\nThe hierarchical latent feature model (HLFM) [15] may be the closest model to the SDREM, as\nthey each build up deep network architecture to model relational data. However, the HLFM uses\na sigmoid belief network, and does not consider high-order node dependencies, so that each node\nonly depends on itself through layers. Finally, feature information enters in the last layer of the\ndeep network architecture, and so the HLFM is unable to suf\ufb01ciently describe nonlinear mappings\nbetween the feature information and the latent representation.\nRecent developments [10, 11] in Poisson matrix factorisation also try to build deep network archi-\ntecture for latent structure modelling. Since these mainly use sigmoid belief networks, the way of\npropagating binary variables is different from our real-valued distributions propagation. Informa-\ntion propagation through Dirichlet distributions in the SDREM follows the approaches of [37][35].\nHowever, their focus is on topic modelling and no neighbourhood-wise propagation is discussed in\nthese methods.\nOur SDREM shares similar spirit of the Variational Graph Auto-Encoder (VGAE) [21, 24] algo-\nrithms. Both of the algorithms aim at combining the graph convolutional networks with Bayesian\nrelational methods. However, VGAE has a larger computational complexity (O(N 2)). It uses param-\neterized functions to construct the deep network architecture and the probabilistic nature occurs in\nthe output layer as Gaussian random variables only. In contrast, SDREM constructs multi-stochastic-\nlayer architectures (with Dirichlet random variables at each layer). Thus, SDREM would have better\nmodel interpretations (see Figure 5).\nWe note that recent work [33] also claims to estimate uncertainty in the graph convolutional neural\nnetworks setting. This work uses a two-stage strategy: it \ufb01rstly takes the observed network as a\nrealisation from a parametric Bayesian relational model, and then uses Bayesian Neural Networks\nto infer the model parameters. The \ufb01nal result is a posterior distribution over these variables. Unlike\nthe SDREM, this work performs the inference in two stages and also lacks inferential interpretability.\nComputational complexities The computational complexity of the SDREM is O(N DK+(N K+\nNE)L + NEK 2) and scales to the number of positive links, NE. In particular, O(N DK) refers to\nthe feature information incorporation in the input layer, O((N K + NE)L) refers to the information\npropagation in the deep network architecture and O(NEK 2) refers to the relational data modelling\nin the output layer. The SDREM\u2019s computational complexity is comparable to that of the HLFM,\nwhich is O(N DK + N KL + NEK 2), and the NARM, which is O(N DK + NEK 2) [34] and is\nsigni\ufb01cantly less than that of the MMSB-type algorithms.\n\n5 Experiments\n\nDataset Information In the following, we examine four real-world datasets: three standard cita-\ntion networks (Citeer, Cora, Pubmed [32] and one protein-to-protein interaction network (PPI) [38].\nSummary statistics for these datasets are displayed in Table 1. In the citation datasets, nodes corre-\nspond to documents and edges represent citation links. A node\u2019s features comprise the documents\u2019\nbag-of-words representations. In the protein-to-protein dataset, we use the pre-processed feature\ninformation provided by [12].\n\n6\n\n\fTable 1: Dataset information. N is the number of nodes, NE is the number of positive links, D is the number\nof features, F.D.= # nonzeros entries=# total entries in F and it refers to the density of features.\n\nDataset\nCiteer\nPubmed\n\nN\n\n3; 312\n2; 000\n\nNE\n4; 715\n17; 522\n\nD\n\n3; 703\n500\n\nF.D.\n0:86%\n1:80%\n\nDataset\nCora\nPPI\n\nN\n\n2; 708\n4; 000\n\nNE\n5; 429\n105; 775\n\nD\n\n1; 433\n\n50\n\nF.D.\n1:27%\n10:20%\n\nFigure 2: Left: Mean AUC (dots) and per iteration computing time (bar heights) comparison between the\nsimpli\ufb01ed SDREM and the MMSB for each dataset. Right: Mean AUC performance as a function of the\nnumber of membership distributions (K; with L = 3) and the number of layers (L; with K = 20).\n\nEvaluation Criteria We primarily focus on link prediction and use this to evaluate model per-\nformance. We use AUC (Area Under ROC Curve) and Average Negative-Log-likelihood on test\nrelational data as the two comparison criteria. The AUC value represents the probability that the al-\ngorithm will rank a randomly chosen existing-link higher than a randomly chosen non-existing link.\nTherefore, the higher the AUC value, the better the predictive performance. For hyper-parameters we\ngl;fc(l)gl are all given Gam(1; 1)\nspecify M (cid:24) Gam(N; 1) for all datasets, and f(cid:13)(1)\npriors. Each reported criteria value is the mean of 10 replicate analyses. Each replicate uses 2000\nMCMC iterations with the \ufb01rst 1000 discarded as burn-in. Unless speci\ufb01ed, reported AUC values\nare obtained by using 90% (per row) of the data as training data and the remaining 10% as test data.\nThe testing relational data are not used when constructing the information propagation matrix (i.e.\nwe set f(cid:12)(l)\ni\u2032i\n\ngl = 0 if Ri\u2032i is testing data).\n\ngd;f(cid:13)(l)\n\nd\n\n1 ; (cid:13)(l)\n\n0\n\nValidating the data augmentation trick: We \ufb01rst evaluate the effectiveness of the data augmen-\ntation trick through comparisons with the MMSB [1]. To make a fair comparison, we specify the\nSDREM as FFF = 0N(cid:2)1; L = 1; K = 20, so that the membership distributions in each model follow\nthe same Dirichlet distribution f(cid:25)(cid:25)(cid:25)igi (cid:24) Dirichlet((cid:11)(cid:1) 1111(cid:2)20). Figure 2 (left panel) displays the mean\nAUC and per iteration running time for these two models. It is clear that the AUC values of the sim-\npli\ufb01ed SDREM are always better than those of the MMSB, and the time required for one iteration in\nthe SDREM is substantially lower (at least two orders of magnitude lower) than that of the MMSB.\nNote that the running time of the SDREM is highest for the PPI dataset, since it contains the largest\nnumber of positive links and the computational cost of the SDREM scales with this value.\n\nDifferent settings of K and L: We evaluate the SDREM\u2019s behaviour under different architecture\nsettings, through the in\ufb02uence of two parameters: K, the length of the membership distributions,\nand L, the number of layers. When testing the effect of different values of K we \ufb01xed L = 3, and\nwhen varying L we \ufb01xed K = 20. Figure 2 (right panel) displays the resulting mean AUC values\nunder these settings. As might be expected, the SDREM\u2019s AUC value increases with higher model\ncomplexity (i.e. larger values of K and L). The worst performance occurs with L = 1 layer as it has\nthe least \ufb02exible modelling capability. Considering the computational complexity and modelling\npower, we set K = 20 and L = 4 for the remaining analyses in this paper.\n\nDeep network architecture: We evaluate the advantage of using neighbourhood connections to\npropagate layer-wise information. Three different deep network architectures are compared: (1)\nPlain-SDREM. We assume the nodes\u2019 feature information is unavailable and use an identity matrix\nto represent the features (i.e. FFF = IN(cid:2)N ) (we tried two cases, FFF = 0N(cid:2)1 and FFF = IN(cid:2)N and found\nthe latter to perform better). (2) Fully-connected-SDREM (Full-SDREM). The propagation coef\ufb01-\ncient B(l)\ni\u2032i is not restricted to be 0 when Ri\u2032i = 0 and instead a hierarchical Gamma process is spec-\ni\ufb01ed as a sparse prior on all the propagation coef\ufb01cients. (3) Independent-SDREM (Inde-SDREM).\n\n7\n\nCiteerPubmedPPICora0.600.650.700.750.800.850.900.951.00AUCAUC for SDREMAUC for MMSB050100150200250300350400450Running Time (seconds per iteration)Running Time in SDREMRunning Time in MMSB10152030 K0.700.750.800.850.90AUCCoraPubmedPPICiteer12345L0.600.650.700.750.800.850.90\fFigure 3: Mean AUC ((cid:6)1:96(cid:2) standard errors (of the mean)) and negative Log-Likelihood ((cid:6)1:96(cid:2) standard\nerrors) on 10% test data for each dataset.\n\nFigure 4: Mean AUC and negative Log-Likelihood values (points) as a function of the proportion of training\ndata (x-axis), for each dataset and deep network architecture. Vertical lines correspond to the 95% con\ufb01dence\ninterval of reported statistics (cid:6)1:96(cid:2) standard error.\n\nThis assumes each node propagates information only to itself and does not exchange information\nwith other nodes in the deep network architecture (i.e. each fBBB(l)gl is a diagonal matrix).\nFigure 3 shows the performance of each of these different con\ufb01gurations against the non-restricted\nSDREM. It is clear that the non-restricted SDREM achieves the best performance in both mean AUC\nand negative-Log-Likelihood among all network con\ufb01gurations. The Full-SDREM consistently per-\nforms the worst among all con\ufb01gurations. This suggests that the fully connected architecture is a\npoor candidate, and the sampler may become easily be trapped in local modes.\n\nPerformance in the presence of feature information: We compare the SDREM with several\nalternative Bayesian methods for relational data and one Graph Convolutional Network model.\nWe examine: the Hierarchical Latent Feature Relational Model (HLFM) [15], the Node Attribute\nRelational Model (NARM) [34], the Hierarchical Gamma Process-Edge Partition Model (HGP-\nEPM) [36] and a graph convolutional neural network (GCN) [20]. The NARM, HGP-EPM and\nGCN methods are executed using their respective authors\u2019 implementations, under their default set-\ntings. The HLFM is implemented to the best of our abilities and we set the same number of layers\nand length of latent binary representation as the SDREM. For the GCN, the AUC value is calculated\nbased on the pairwise similarities between the node representations and the ground-truth relational\ndata and the Negative Log-Likelihood is unavailable due to its frequentist setting.\nFigure 4 shows the performance of each method on the four datasets, under different ratios of training\ndata (x-axis). In terms of AUC, the SDREM performs the best among all the methods when the\nproportion of training data ratio is larger than 0:5. However, the performance of the SDREM is\nnot outstanding when the training data ratio is less than 0:5. This may partly be due to there being\ninsuf\ufb01cient relational data to effectively model the latent counts. Since the SDREM and the HLFM\nare the best performing two algorithms in most cases, this con\ufb01rms the effectiveness of utilising a\ndeep network architecture. Similarly conclusions can be drawn based on the negative log-likelihood:\nthe SDREM and the HLFM are the best performing two algorithms.\n\n8\n\nCiteerPubmedPPICora0.550.600.650.700.750.800.850.900.95AUCSDREMPlain-SDREMFull-SDREMInde-SDREMCiteerPubmedPPICora0.0000.0050.0100.0150.0200.025Neg-Log-likelihood0.10.20.30.40.50.60.70.80.90.550.600.650.700.75AUCCiteerSDREMNARMHLFMHGP-EPMGCN0.10.20.30.40.50.60.70.80.90.00300.00350.00400.0045Neg-Log-likelihoodSDREMNARMHLFMHGP-EPM0.10.20.30.40.50.60.70.80.90.60.70.80.9Pubmed0.10.20.30.40.50.60.70.80.90.01750.02000.02250.02500.02750.03000.03250.10.20.30.40.50.60.70.80.90.800.850.900.95PPI0.10.20.30.40.50.60.70.80.90.0200.0250.0300.0350.10.20.30.40.50.60.70.80.90.600.650.700.750.800.85Cora0.10.20.30.40.50.60.70.80.90.004750.005000.005250.005500.005750.006000.006250.00650\fg3\nFigure 5: Left: visualizations on the membership distributions (f(cid:25)(cid:25)(cid:25)(l)\nl=1) and normalized auxiliary count-\ning variable ( (cid:22)XXX 1:50) for the \ufb01rst 50 nodes of the Citeer dataset (row represents the nodes and column rep-\nresents the latent features); right: visualizations on the non-zero positions (RRR + III) and transition coef\ufb01cient\nmatrix (f(cid:12)(cid:12)(cid:12)(l)g2\n\nl=1) for the \ufb01rst 200 nodes of the Citeer dataset.\n\n1:50\n\nTable 2: Average latent counts (per node) in different layers.\n\nDataset\nCiteer\nPubmed\n\nLayer 3 Layer 2 Layer 1 Dataset\nCora\n533:7\nPPI\n292:4\n\n7:8\n24:8\n\n2:5\n10:1\n\nLayer 3 Layer 2 Layer 1\n290:1\n65:6\n\n2:3\n12:7\n\n7:0\n20:1\n\nComparison with Variational Graph Auto-Encoder We also make brief comparisons with the\nVariational Graph Auto-Encoder (VGAE) [21]. Taking 90% of the data as training data and the\nremaining as testing data, the average AUC scores of 16 random VGAE runs for these datasets\nare: Citeseer (0.863), Cora (0.854), Pubmed (0.921) and PPI (0.934). Considering the attributes of\nthese datasets, we \ufb01nd that VGAE obtains a better performance than our SDREM in the datasets\nwith sparse linkages, whereas their performance in other types of datasets are competitive. This\nphenomenon might be caused by two reasons: (1) due to the inference nature (backward latent counts\npropagating and forward variable sampling), our SDREM propagates less counting information (see\nTable 2) to higher layers. The deep hierarchical structure might be less powerful in sparse networks;\n(2) the Sigmod and ReLu activation functions might be more \ufb02exible than the Dirichlet distribution\nfor the case of sparse networks. We will keep on investigating this issue in the future work.\n\nLatent structure visualization: We also visualize the latent structures of the model to get further\ninsights in Figure 5. According to the left panel, we can see that the membership distributions\ngradually become more distinguished along with the layers. The less distinguished membership\ndistributions might due to two reasons: (1) the higher abstraction of the latent features; (2) the\ninsuf\ufb01cient latent counting information back-propagated to these higher layers. The normalized\nlatent counting vector (XXX) looks to be identical to the output membership distribution (cid:25)(cid:25)(cid:25)(3). This\nveri\ufb01es that our introduction of XXX seems to successfully pass the information to the latent integers\nvariable ZZZ. In the right panel of information propagation matrix, we can see that the neighbourhood-\nwise information seems to become weaker from the input layer to the output layer.\n\n6 Conclusion\n\nWe have introduced a Bayesian framework by using deep latent representations for nodes to model\nrelational data. Through ef\ufb01cient neighbourhood-wise information propagation in the deep network\narchitecture and a novel data augmentation trick, the proposed SDREM is a promising approach\nfor modelling scalable networks. As the SDREM can provide variability estimates for its latent\nvariables and predictions, it has the potential to be a competitive alternative to frequentist graph con-\nvolutional network-type algorithms. The promising experimental results validate the effectiveness of\nthe SDREM\u2019s deep network architecture and its competitive performance against other approaches.\nSince the SDREM is the \ufb01rst work to use neighbourhood-wise information propagation in Bayesian\nmethods, combining this with other Bayesian relational models and other applications with pairwise\ndata (e.g. collaborative \ufb01ltering) would be interesting future work.\n\n9\n\n(1)1:50(2)1:50(3)1:50X1:50R+I(1)(2)\fAcknowledgements\n\nXuhui Fan and Scott A. Sisson are supported by the Australian Research Council through the Aus-\ntralian Centre of Excellence in Mathematical and Statistical Frontiers (ACEMS, CE140100049),\nand Scott A. Sisson through the Discovery Project Scheme (DP160102544). Bin Li is supported\nby Shanghai Municipal Science & Technology Commission (16JC1420401) and the Program for\nProfessor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning.\n\nReferences\n[1] Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing. Mixed membership\n\nstochastic blockmodels. In NIPS, pages 33\u201340, 2009.\n\n[2] James Atwood and Don Towsley. Diffusion-convolutional neural networks. In NIPS, pages\n\n1993\u20132001, 2016.\n\n[3] Hanjun Dai, Zornitsa Kozareva, Bo Dai, Alex Smola, and Le Song. Learning steady-states of\n\niterative algorithms over graphs. In ICML, pages 1114\u20131122, 2018.\n\n[4] David B Dunson and Amy H Herring. Bayesian latent variable models for mixed discrete\n\noutcomes. Biostatistics, 6(1):11\u201325, 2005.\n\n[5] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,\nAl\u00e1n Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molec-\nular \ufb01ngerprints. In NIPS, pages 2224\u20132232, 2015.\n\n[6] Xuhui Fan, Bin Li, and Scott Sisson. Rectangular bounding process. In NeurIPS, pages 7631\u2013\n\n7641, 2018.\n\n[7] Xuhui Fan, Bin Li, and Scott Sisson. Binary space partitioning forests. In AISTATS, volume 89\n\nof Proceedings of Machine Learning Research, 2019.\n\n[8] Xuhui Fan, Bin Li, and Scott A. Sisson. The binary space partitioning-tree process. In AISTATS,\n\nvolume 84 of Proceedings of Machine Learning Research, pages 1859\u20131867, 2018.\n\n[9] Xuhui Fan, Bin Li, Yi Wang, Yang Wang, and Fang Chen. The Ostomachion Process. In AAAI\n\nConference on Arti\ufb01cial Intelligence, pages 1547\u20131553, 2016.\n\n[10] Zhe Gan, Ricardo Henao, David Carlson, and Lawrence Carin. Learning deep sigmoid belief\n\nnetworks with data augmentation. In AISTATS, pages 268\u2013276, 2015.\n\n[11] Zhe Gan, Chunyuan Li, Ricardo Henao, David E Carlson, and Lawrence Carin. Deep temporal\n\nsigmoid belief networks for sequence modeling. In NIPS, pages 2467\u20132475. 2015.\n\n[12] Will Hamilton, Zhitao Ying, and Jure Leskovec.\n\ngraphs. In NIPS, pages 1024\u20131034, 2017.\n\nInductive representation learning on large\n\n[13] Qirong Ho, Ankur P. Parikh, and Eric P. Xing. A multiscale community blockmodel for net-\n\nwork exploration. Journal of the American Statistical Association, 107(499):916\u2013934, 2012.\n\n[14] Changwei Hu, Piyush Rai, and Lawrence Carin. Non-negative matrix factorization for discrete\n\ndata with hierarchical side-information. In AISTATS, pages 1124\u20131132, 2016.\n\n[15] Changwei Hu, Piyush Rai, and Lawrence Carin. Deep generative models for relational data\n\nwith side information. In ICML, pages 1578\u20131586, 2017.\n\n[16] Ilkka Huopaniemi, Tommi Suvitaival, Janne Nikkil\u00e4, Matej Ore\u0161i\u02c7c, and Samuel Kaski. Multi-\n\nvariate multi-way analysis of multi-source data. Bioinformatics, 26(12):i391\u2013i398, 2010.\n\n[17] Brian Karrer and Mark E.J. Newman. Stochastic blockmodels and community structure in\n\nnetworks. Physical Review E, 83(1):016107, 2011.\n\n[18] Charles Kemp, Joshua B. Tenenbaum, Thomas L. Grif\ufb01ths, Takeshi Yamada, and Naonori\nUeda. Learning systems of concepts with an in\ufb01nite relational model. In AAAI, pages 381\u2013\n388, 2006.\n\n10\n\n\f[19] Dae Il. Kim, Michael Hughes, and Erik Sudderth. The nonparametric metadata dependent\n\nrelational model. In ICML, pages 1559\u20131566, 2012.\n\n[20] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. 2016.\n\n[21] Thomas N Kipf and Max Welling. Variational graph auto-encoders.\n\narXiv:1611.07308, 2016.\n\narXiv preprint\n\n[22] Phaedon-Stelios. Koutsourelakis and Tina Eliassi-Rad. Finding mixed-memberships in social\n\nnetworks. In AAAI, 2008.\n\n[23] Bin Li, Qiang Yang, and Xiangyang Xue. Transfer learning for collaborative \ufb01ltering via a\n\nrating-matrix generative model. In ICML, pages 617\u2013624, 2009.\n\n[24] Nikhil Mehta, Lawrence Carin, and Piyush Rai. Stochastic blockmodels meet graph neural\n\nnetworks. arXiv preprint arXiv:1905.05738, 2019.\n\n[25] Kurt Miller, Michael I. Jordan, and Thomas L. Grif\ufb01ths. Nonparametric latent feature models\n\nfor link prediction. In NIPS, pages 1276\u20131284, 2009.\n\n[26] Mark EJ Newman and Michelle Girvan. Finding and evaluating community structure in net-\n\nworks. Physical review E, 69(2):026113, 2004.\n\n[27] Krzysztof Nowicki and Tom A.B. Snijders. Estimation and prediction for stochastic block\n\nstructures. Journal of the American Statistical Association, 96(455):1077\u20131087, 2001.\n\n[28] Konstantina Palla, David A. Knowles, and Zoubin Ghahramani. An in\ufb01nite latent attribute\n\nmodel for network data. In ICML. 2012.\n\n[29] Ian Porteous, Evgeniy Bart, and Max Welling. Multi-HDP: A non parametric Bayesian model\n\nfor tensor factorization. In AAAI, pages 1487\u20131490, 2008.\n\n[30] Piyush Rai, Changwei Hu, Ricardo Henao, and Lawrence Carin. Large-scale bayesian multi-\n\nlabel learning via topic-based label embeddings. In NIPS, pages 3222\u20133230. 2015.\n\n[31] Steven M Roman and Gian-Carlo Rota. The umbral calculus. Advances in Mathematics,\n\n27(2):95 \u2013 188, 1978.\n\n[32] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-\n\nRad. Collective classi\ufb01cation in network data. In AI magazine, pages 29\u201393, 2008.\n\n[33] Yingxue Zhang, Soumyasundar Pal, Mark Coates, and Deniz \u00dcstebay. Bayesian graph convo-\nlutional neural networks for semi-supervised classi\ufb01cation. arXiv preprint arXiv:1811.11103,\n2018.\n\n[34] He Zhao, Lan Du, and Wray Buntine. Leveraging node attributes for incomplete relational\n\ndata. In ICML, pages 4072\u20134081, 2017.\n\n[35] He Zhao, Lan Du, Wray Buntine, and Mingyuan Zhou. Dirichlet belief networks for topic\n\nstructure learning. In NeurIPS, pages 7966\u20137977, 2018.\n\n[36] Mingyuan Zhou. In\ufb01nite edge partition models for overlapping community detection and link\n\nprediction. In AISTATS, pages 1135\u20131143, 2015.\n\n[37] Mingyuan Zhou, Yulai Cong, and Bo Chen. Augmentable gamma belief networks. Journal of\n\nMachine Learning Research, 17(163):1\u201344, 2016.\n\n[38] Marinka Zitnik and Jure Leskove. Predicting multicellular function through multi-layer tissue\n\nnetworks. In Bioinformatics, pages i190\u2013i198, 2017.\n\n11\n\n\f", "award": [], "sourceid": 6878, "authors": [{"given_name": "Xuhui", "family_name": "Fan", "institution": "University of New South Wales"}, {"given_name": "Bin", "family_name": "Li", "institution": "Fudan University"}, {"given_name": "Caoyuan", "family_name": "Li", "institution": "UTS"}, {"given_name": "Scott", "family_name": "SIsson", "institution": "University of New South Wales, Sydney"}, {"given_name": "Ling", "family_name": "Chen", "institution": "\" University of Technology, Sydney, Australia\""}]}