{"title": "Random function priors for exchangeable arrays with applications to graphs and relational data", "book": "Advances in Neural Information Processing Systems", "page_first": 998, "page_last": 1006, "abstract": null, "full_text": "Random function priors for exchangeable arrays with\n\napplications to graphs and relational data\n\nJames Robert Lloyd\n\nDepartment of Engineering\nUniversity of Cambridge\n\nPeter Orbanz\n\nDepartment of Statistics\n\nColumbia University\n\nZoubin Ghahramani\n\nDepartment of Engineering\nUniversity of Cambridge\n\nDaniel M. Roy\n\nDepartment of Engineering\nUniversity of Cambridge\n\nAbstract\n\nA fundamental problem in the analysis of structured relational data like graphs,\nnetworks, databases, and matrices is to extract a summary of the common struc-\nture underlying relations between individual entities. Relational data are typically\nencoded in the form of arrays; invariance to the ordering of rows and columns\ncorresponds to exchangeable arrays. Results in probability theory due to Aldous,\nHoover and Kallenberg show that exchangeable arrays can be represented in terms\nof a random measurable function which constitutes the natural model parameter in\na Bayesian model. We obtain a \ufb02exible yet simple Bayesian nonparametric model\nby placing a Gaussian process prior on the parameter function. Ef\ufb01cient inference\nutilises elliptical slice sampling combined with a random sparse approximation\nto the Gaussian process. We demonstrate applications of the model to network\ndata and clarify its relation to models in the literature, several of which emerge as\nspecial cases.\n\n1\n\nIntroduction\n\nStructured relational data arises in a variety of contexts, including graph-valued data [e.g. 1, 5],\nmicro-array data, tensor data [e.g. 27] and collaborative \ufb01ltering [e.g. 21]. This data is typi\ufb01ed by\nexpressing relations between 2 or more objects (e.g. friendship between a pair of users in a social\nnetwork). Pairwise relations can be represented by a 2-dimensional array (a matrix); more generally,\nrelations between d-tuples are recorded as d-dimensional arrays (d-arrays). We consider Bayesian\nmodels of in\ufb01nite 2-arrays (Xij)i,j\u2208N, where entries Xij take values in a space X . Each entry\nXij describes the relation between objects i and j. Finite samples\u2014relational measurements for n\nobjects\u2014are n \u00d7 n-arrays. As the sample size increases, the data aggregates into a larger and larger\narray. Graph-valued data, for example, corresponds to the case X = {0, 1}. In collaborative \ufb01ltering\nproblems, the set of objects is subdivided into two disjoint sets, e.g., users and items.\nLatent variable models for such data explain observations by means of an underlying structure or\nsummary, such as a low-rank approximation to an observed array or an embedding into a Euclidean\nspace. This structure is formalized as a latent (unobserved) variable. Examples include matrix\nfactorization [e.g. 4, 21], non-linear generalisations [e.g. 12, 27, 28], block modelling [e.g. 1, 10],\nlatent distance modelling [e.g. 5] and many others [e.g. 14, 17, 20].\nHoff [4] \ufb01rst noted that a number of parametric latent variable models for relational data are\nexchangeable\u2014an applicable assumption whenever the objects in the data have no natural order-\ning e.g., users in a social network or products in ratings data\u2014and can be cast into the common\nfunctional form guaranteed to exist by results in probability theory. Building on this connection,\n\n1\n\n\fU1\n\nU2\n\n0\n\n0\n\nU1\n\nU2\n\n0\nPr{Xij = 1}\n\n\u0398\n\n1\n\n1\n\n1\n\nFigure 1: Left: The distribution of any exchangeable random graph with vertex set N and edges\nE = (Xij)i,j\u2208N can be characterised by a random function \u0398 : [0, 1]2 \u2192 [0, 1]. Given \u0398, a graph\ncan be sampled by generating a uniform random variable Ui for each vertex i, and sampling edges as\nXij \u223c Bernoulli(\u0398(Ui, Uj)). Middle: A heat map of an example function \u0398. Right: A 100 \u00d7 100\nsymmetric adjacency matrix sampled from \u0398. Only unordered index pairs Xij are sampled in the\nsymmetric case. Rows and columns have been ordered by increasing value of Ui, rather than i.\n\nwe consider nonparametric models for graphs and arrays. Results of Aldous [2], Hoover [6] and\nKallenberg [7] show that random arrays that satisfy an exchangeability property can be represented\nin terms of a random function. These representations have been further developed in discrete analy-\nsis for the special case of graphs [13]; this case is illustrated in Fig. 1. The results can be regarded as\na generalization of de Finetti\u2019s theorem to array-valued data. Their implication for Bayesian model-\ning is that we can specify a prior for an exchangeable random array model by specifying a prior on\n(measurable) functions. The prior is a distribution on the space of all functions that can arise in the\nrepresentation result, and the dimension of this space is in\ufb01nite. A prior must therefore be nonpara-\nmetric to have reasonably large support since a parametric prior concentrates on a \ufb01nite-dimensional\nsubset. In the following, we model the representing function explicitly using a nonparametric prior.\n\n2 Background: Exchangeable graphs and arrays\n\nA fundamental component of every Bayesian model is a random variable \u0398, the parameter of the\nmodel, which decouples the data. De Finetti\u2019s theorem [9] characterizes this parameter for random\nsequences: Let X1, X2, . . . be an in\ufb01nite sequence of random variables, each taking values in a\ncommon space X . A sequence is called exchangeable if its joint distribution is invariant under\narbitrary permutation of the indices, i.e., if\n\n(X1, X2, . . .) d= (X\u03c0(1), X\u03c0(2), . . .)\n\n(2.1)\nHere, d= denotes equality in distribution, and S\u221e is the set of all permutations of N that permute\na \ufb01nite number of elements. De Finetti\u2019s theorem states that, (Xi)i\u2208N is exchangeable if and only\nif there exists a random probability measure \u0398 on X such that X1, X2, . . . | \u0398 \u223ciid \u0398, i.e., condi-\ntioned on \u0398, the observations are independent and \u0398-distributed. From a statistical perspective, \u0398\nrepresents common structure in the observed data\u2014and thus a natural target of statistical inference\u2014\nwhereas P [Xi|\u0398] captures remaining, independent randomness in each observation.\n\nfor all \u03c0 \u2208 S\u221e .\n\n2.1 De Finetti-type representations for random arrays\n\nTo specify Bayesian models for graph- or array-valued data, we need a suitable counterpart to de\nFinetti\u2019s theorem that is applicable when the random sequences in (2.1) are substituted by random\narrays X = (Xij)i,j\u2208N. For such data, the invariance assumption (2.1) applied to all elements of X\nis typically too restrictive: In the graph case Xij \u2208 {0, 1}, for example, the probability of X would\nthen depend only on the proportion of edges present in the graph, but not on the graph structure.\nInstead, we de\ufb01ne exchangeability of random 2-arrays in terms of the simultaneous application of a\npermutation to rows and columns. More precisely:\nDe\ufb01nition 2.1. An array X = (Xij)i,j\u2208N is called an exchangeable array if\n\n(Xij) d= (X\u03c0(i)\u03c0(j))\n\nfor every \u03c0 \u2208 S\u221e .\n\n(2.2)\n\n2\n\n\fSince this weakens the hypothesis (2.1) by demanding invariance only under a subset of all permuta-\ntions of N2\u2014those of the form (i, j) (cid:55)\u2192 (\u03c0(i), \u03c0(j))\u2014we can no longer expect de Finetti\u2019s theorem\nto hold. The relevant generalization of the de Finetti theorem to this case is the following:\nTheorem 2.2 (Aldous, Hoover). A random 2-array (Xij) is exchangeable if and only if there is a\nrandom (measurable) function F : [0, 1]3 \u2192 X such that\n\n(2.3)\nfor every collection (Ui)i\u2208N and (Uij)i\u2264j\u2208N of i.i.d. Uniform[0, 1] random variables, where Uji =\nUij for j < i \u2208 N.\n\nd\n= (F (Ui, Uj, Uij)).\n\n(Xij)\n\n2.2 Random graphs\nThe graph-valued data case X = {0, 1} is of particular interest. Here, the array X, interpreted\nas an adjacency matrix, speci\ufb01es a random graph with vertex set N. For undirected graphs, X is\nsymmetric. We call a random graph exchangeable if X satis\ufb01es (2.2).\nFor undirected graphs, the representation (2.3) simpli\ufb01es further:\n\u0398 : [0, 1]2 \u2192 [0, 1], symmetric in its arguments, such that\n\nthere is a random function\n\nF (Ui, Uj, Uij) :=\n\nif Uij < \u0398(Ui, Uj)\notherwise\n\n(2.4)\n\n(cid:26)1\n\n0\n\nsatis\ufb01es (2.3). Each variable Ui is associated with a vertex, each variable Uij with an edge. The\nrepresentation (2.4) is equivalent to the sampling scheme\n\nU1, U2, . . . \u223ciid Uniform[0, 1]\n\nand\n\nXij = Xji \u223c Bernoulli(\u0398(Ui, Uj)) ,\n\n(2.5)\n\nwhich is illustrated in Fig. 1.\nRecent work in discrete analysis shows that any symmetric measurable function [0, 1]2 \u2192 [0, 1]\ncan be regarded as a (suitably de\ufb01ned) limit of adjacency matrices of graphs of increasing size\n[13]\u2014intuitively speaking, as the number of rows and columns increases, the array in Fig. 1 (right)\nconverges to the heat map in Fig. 1 (middle) (up to a reordering of rows and columns).\n\n2.3 The general case: d-arrays\n\nTheorem 2.2 can in fact be stated in a more general setting than 2-arrays, namely for random d-\narrays, which are collections of random variables of the form (Xi1...id)i1,...,id\u2208N. Thus, a sequence\nis a 1-array, a matrix a 2-array. A d-array can be interpreted as an encoding of a relation between\nd-tuples.\nIn this general case, an analogous theorem holds, but the random function F in (2.3)\nis in general more complex: In addition to the collections U{i} and U{ij} of uniform variables, the\nrepresentation requires an additional collection U{ij}j\u2208I for every non-empty subset I \u2286 {1, . . . , d};\ne.g., U{i1i3i4} for d \u2265 4 and I = {1, 3, 4}. The representation (2.3) is then substituted by\n\nF : [0, 1]2d\u22121 \u2212\u2192 X\n\nand\n\n(Xi1,...,id ) d= (F (UI1, . . . , UI(2d\u22121)\n\n)) .\n\n(2.6)\n\nFor d = 1, we recover a version of de Finetti\u2019s theorem. For a discussion of convergence properties\nof general arrays similar to those sketched above for random graphs, see [3].\nBecause we do not explicitly consider the case d > 2 in our experiments, we restrict our presenta-\ntion of the model to the 2-array-valued case for simplicity. We note, however, that the model and\ninference algorithms described in the following extend immediately to general d-array-valued data.\n\n3 Model\n\nTo de\ufb01ne a Bayesian model for exchangeable graphs or arrays, we start with Theorem 2.2: A distri-\nbution on exchangeable arrays can be speci\ufb01ed by a distribution on measurable functions [0, 1]3 \u2192\nX . We decompose the function F into two functions \u0398 : [0, 1]2 \u2192 W and H : [0, 1] \u00d7 W \u2192 X for\na suitable space W, such that\n\n(Xij) d= (F (Ui, Uj, Uij)) = (H(Uij, \u0398(Ui, Uj))) .\n\n(3.1)\n\n3\n\n\fSuch a decomposition always exists\u2014trivially, choose W = [0, 1]2. The decomposition introduces\na natural hierarchical structure. We initially sample a random function \u0398\u2014the model parameter in\nterms of Bayesian statistics\u2014which captures the structure of the underlying graph or array. The\n(Ui) then represent attributes of nodes or objects and H and the array (Uij) model the remaining\nnoise in the observed relations.\nModel de\ufb01nition. For the purpose of de\ufb01ning a Bayesian model, we will model \u0398 as a continuous\nfunction with a Gaussian process prior. More precisely, we take W = R and consider a zero-mean\nGaussian process prior on CW := C([0, 1]2,W), the space of continuous functions from[0, 1]2 to\nW, with kernel function \u03ba : [0, 1]2 \u00d7 [0, 1]2 \u2192 W. The full generative model is then:\n\n\u0398 \u223c GP(0, \u03ba)\n\nU1, U2, . . . \u223ciid Uniform[0, 1]\nXij |Wij \u223c P [ .|Wij]\n\nwhere Wij = \u0398(Ui, Uj) .\n\n(3.2)\n\nThe parameter space of our the model is the in\ufb01nite-dimensional space CW. Hence, the model is\nnonparametric.\nGraphs and real-valued arrays require different choices of P . In either case, the model \ufb01rst generates\nthe latent array W = (Wij). Observations are then generated as follows:\nSample space P [Xij \u2208 .|Wij]\nX = {0, 1}\nX = R\n\nObserved data\nGraph\nReal array\n\nBernoulli(\u03c6(Wij))\nNormal(Wij, \u03c32X )\n\nwhere \u03c6 is the logistic function, and \u03c32X is a noise variance parameter.\nThe Gaussian process prior favors smooth functions, which will in general result in more inter-\npretable latent space embeddings. Inference in Gaussian processes is a well-understood problem,\nand the choice of a Gaussian prior allows us to leverage the full range of inference methods available\nfor these models.\nDiscussion of modeling assumptions. In addition to exchangeability, our model assumes (i) that\nthe function \u0398 is continuous\u2014which implies measurability as in Theorem 2.2 but is a stronger\nrequirement\u2014and (ii) that its law is Gaussian. Exchangeable, undirected graphs are always rep-\nresentable using a Bernoulli distribution for P [Xij \u2208 .|Wij]. Hence, in this case, (i) and (ii) are\nindeed the only assumptions imposed by the model. In the case of real-valued matrices, the model\nadditionally assumes that the function H in (3.1) is of the form\n\nwhere\n\n\u03b5ij \u223ciid Normal(0, \u03c3) .\n\nH(Uij, \u0398(Ui, Uj)) d= \u0398(Ui, Uj) + \u03b5ij\n\n(3.3)\nAnother rather subtle assumption arises implicitly when the array X is not symmetric, i.e., not\nguaranteed to satisfy Xij = Xji, for example, if X is a directed graph: In Theorem 2.2, the array\n(Uij) is symmetric even if X is not. The randomness in Uij accounts for both Xij and Xji which\nmeans the conditional variables Xij|Wij and Xji|Wji are dependent, and a precise representation\nwould have to sample (Xij, Xji)|Wij, Wji jointly, a fact our model neglects in (3.2). However, it\ncan be shown that any exchangeable array can be arbitrarily well approximated by arrays which treat\nXij|Wij and Xji|Wji as independent [8, Thm. 2].\nRemark 3.1 (Dense vs. sparse data). The methods described here address random arrays that are\ndense, i.e., as the size of an n \u00d7 n array increases the number of non-zero entries grows as O(n2).\nNetwork data is typically sparse, with O(n) non-zero entries. Density is an immediate consequence\n\nof Theorem 2.2: For graph data the asymptotic proportion of present edges is p :=(cid:82) \u0398(x, y)dxdy,\n\nand the graph is hence either empty (for p = 0) or dense (since O(pn2) = O(n2)). Analogous\nrepresentation theorems for sparse random graphs are to date an open problem in probability.\n\n4 Related work\n\nOur model has some noteworthy relations to the Gaussian process latent variable model (GPLVM);\na dimensionality-reduction technique [e.g. 11]. GPLVMs can be applied to 2-arrays, but doing so\nmakes the assumption that either the rows or the columns of the random array are independent [12].\nIn terms of our model, this corresponds to choosing kernels of the form \u03baU \u2297 \u03b4, where \u2297 represents\n\n4\n\n\fa tensor product1and \u03b4 represents an \u2018identity\u2019 kernel (i.e., the corresponding kernel matrix is the\nidentity matrix). From this perspective, the application of our model to exchangeable real-valued\narrays can be interpreted as a form of co-dimensionality reduction.\nFor graph data, a related parametric model is the eigenmodel of Hoff [4]. This model, also justi\ufb01ed\nby exchangeability arguments, approximates an array with a bilinear form, followed by some link\nfunction and conditional probability distribution.\nAvailable nonparametric models include the in\ufb01nite relational model (IRM) [10], latent feature re-\nlational model (LFRM) [14], in\ufb01nite latent attribute model (ILA) [17] and many others. A recent\ndevelopment is the sparse matrix-variate Gaussian process blockmodel (SMGB) of Yan et al. [28].\nAlthough not motivated in terms of exchangeability, this model does not impose an independence\nassumptions on either rows or columns, in contrast to the GPLVM. The model uses kernels of the\nform \u03ba1 \u2297 \u03ba2; our work suggests that it may not be necessary to impose tensor product struc-\nture, which allows for inference with improved scaling. Roy and Teh [20] present a nonparametric\nBayesian model of relational data that approximates \u0398 by a piece-wise constant function with a\nspeci\ufb01c hierarchical structure, which is called a Mondrian process in [20].\nSome examples of the various available models can be succinctly summarized as follows:\n\nGraph data\n\u223c GP (0, \u03ba)\n\n\u0398\nWij = mUiUj where Ui \u2208 {1, . . . , K}\nWij = mUiUj where Ui \u2208 {1, . . . ,\u221e}\nWij = \u2212|Ui \u2212 Uj|\nWij = U(cid:48)\nWij = U(cid:48)\n\ni \u039bUj\ni \u039bUj where Ui \u2208 {0, 1}\u221e\nIUid\n\nWij = (cid:80)\n\nIUjd\u039b(d)\n\u223c GP (0, \u03ba1 \u2297 \u03ba2)\n\nUidUjd\n\n\u0398\n\nd\n\nwhere Ui \u2208 {0, . . . ,\u221e}\u221e\n\nReal-valued array data\n\n\u0398\n\u0398\nWij = U(cid:48)\n\u0398\n\n\u223c GP (0, \u03ba)\n=\n\u223c GP (0, \u03ba \u2297 \u03b4)\n\ni Vj\n\npiece-wise constant random function\n\nRandom function model\nLatent class [26]\nIRM [10]\nLatent distance [5]\nEigenmodel [4]\nLFRM [14]\nILA [17]\nSMGB [28]\n\nRandom function model\nMondrian process based [20]\nPMF [21]\nGPLVM [12]\n\n5 Posterior computation\n\nWe describe Markov Chain Monte Carlo (MCMC) algorithms for generating approximate samples\nfrom the posterior distribution of the model parameters given a partially observed array. Most impor-\ntantly, we describe a random subset-of-regressors approximation that scales to graphs with hundreds\nof nodes. Given the relatively straightforward nature of the proposed algorithms and approximations,\nwe refer the reader to other papers whenever appropriate.\n\n5.1 Latent space and kernel\n\nTheorem 2.2 is not restricted to the use of uniform distributions for the variables Ui and Uij. The\nproof remains unchanged if one replaces the uniform distributions with any non-atomic probability\nmeasure on a Borel space. For the purposes of inference, normal distributions are more convenient,\nand we henceforth use U1, U2, . . . \u223ciid N (0, Ir) for some integer r.\nSince we focus on undirected graphical data, we require the symmetry condition Wij = Wji. This\ncan be achieved by constructing the kernel function in the following way\n\n(cid:0)\u00af\u03ba(\u03be1, \u03be2) + \u00af\u03ba(\u03be1, \u00af\u03be2)(cid:1) + \u03c32I\n\n1\n2\n\n\u03ba(\u03be1, \u03be2) =\n\u00af\u03ba(\u03be1, \u03be2) = s2 exp(\u2212|\u03be1 \u2212 \u03be2|2/(2(cid:96)2))\n\n1We de\ufb01ne the tensor product of kernel functions as follows:\n\n\u03baU (u1, u2) \u00d7 \u03baV (v1, v2).\n\n5\n\n(Symmetry + noise)\n\n(5.1)\n\n(RBF kernel)\n\n(5.2)\n(\u03baU \u2297 \u03baV )((u1, v1), (u2, v2)) =\n\n\fwhere \u03bek = (Uik , Ujk ), \u00af\u03bek = (Ujk , Uik ) and s, (cid:96), \u03c3 represent a scale factor, length scale and noise\nrespectively (see [e.g. 19] for a discussion of kernel functions). We collectively denote the kernel\nparameters by \u03c8.\n\n5.2 Sampling without approximating the model\n\nIn the simpler case of a real-valued array X, we construct an MCMC algorithm over the variables\n(U, \u03c8, \u03c3X ) by repeatedly slice sampling [16] from the conditional distributions\n\n\u03c8i | \u03c8\u2212i, \u03c3X , U, X\n\n\u03c3X | \u03c8, U, X\n\nand\n\n(5.3)\nwhere \u03c3X is the noise variance parameter used when modelling real valued data introduced in section\n3. Let N = |U{i}| denote the number of rows in the observed array, let \u03be be the set of all pairs\n(Ui, Uj) for all observed relations Xij, let O = |\u03be| denote the number of observed relations, and let\nK represent the O \u00d7 O kernel matrix between all points in \u03be. Changes to \u03c8 affect every entry in\nthe kernel matrix K and so, naively, the computation of the Gaussian likelihood of X takes O(O3)\ntime. The cubic dependence on O seems unavoidable, and thus this naive algorithm is unusable for\nall but small data sets.\n\nUj | U\u2212j, \u03c8, \u03c3X , X\n\n5.3 A random subset-of-regressor approximation\n\nTo scale the method to larger graphs, we apply a variation of a method known as Subsets-of-\nRegressors (SoR) [22, 23, 25]. (See [18] for an excellent survey of this and other sparse approx-\nimations.) The SoR approximation replaces the in\ufb01nite dimensional GP with a \ufb01nite dimensional\napproximation. Our approach is to treat both the inputs and outputs of the GP as latent variables.\nIn particular, we introduce k Gaussian distributed pseudoinputs \u03b7 = (\u03b71, . . . , \u03b7k) and de\ufb01ne target\nvalues Tj = \u0398(\u03b7j). Writing K\u03b7\u03b7 for the kernel matrix formed from the pseudoinputs \u03b7, we have\n\nT | \u03b7 \u223c N (0, K\u03b7\u03b7).\n\n(\u03b7i) \u223ciid N (0, I2r)\n\nand\n\n(5.4)\nThe idea of the SoR approximation is to replace Wij with the posterior mean conditioned on (\u03b7, T ),\n(5.5)\nwhere K\u03be\u03b7 is the kernel matrix between the latent embeddings \u03be and the pseudoinputs \u03b7. By consid-\nering random pseudoinputs, we construct an MCMC analogue of the techniques proposed in [24].\nThe conditional distribution T | U, \u03b7, \u03c8, (\u03c3X ), X is amenable to elliptical slice sampling [15]. All\nother random parameters, including the (Ui), can again be sampled from their full conditional dis-\ntributions using slice sampling. The sampling algorithms require that one computes expressions\ninvolving (5.5). As a result they cost at most O(k3O) time.\n\nW = K\u03be\u03b7K\u22121\n\n\u03b7\u03b7 T,\n\n6 Experiments\n\nWe evaluate the model on three different network data sets. Two of these data sets\u2014the high school\nand NIPS co-authorship data\u2014have been extensively analyzed in the literature. The third data set,\na protein interactome, was previously noted by Hoff [4] to be of interest since it exhibits both block\nstructure and transitivity.\n\nData set\nHigh school\nNIPS\nProtein\n\nRecorded data\nhigh school social network\ndensely connected subset of coauthorship network\nprotein interactome\n\nVertices Reference\n\n90\n234\n230\n\ne.g. [4]\ne.g. [14]\ne.g. [4]\n\nWe compare performance of our model on these data sets to three other models, probabilistic matrix\nfactorization (PMF) [21], Hoff\u2019s eigenmodel, and the GPLVM (see also Sec. 4). The models are\nchosen for comparability, since they all embed nodes into a Euclidean latent space. Experiments for\nall three models were performed using reference implementations by the respective authors.2\n\n2Implementations are available for PMF at http://www.mit.edu/~rsalakhu/software.html;\n\nthe\neignmodel at http://cran.r-project.org/src/contrib/Descriptions/eigenmodel.html; and for the GPLVM at\nhttp://www.cs.man.ac.uk/~neill/collab/ .\n\nfor\n\n6\n\n\fFigure 2: Protein interactome data. Left: Interactome network. Middle: Sorted adjacency matrix.\nThe network exhibits stochastic equivalence (visible as block structure in the matrix) and homophily\n(concentration of points around the diagonal). Right: Maximum a posteriori estimate of the function\n\u0398, corresponding to the function in Fig. 1 (middle).\n\nModel\nPMF [21]\nEigenmodel [4]\nGPLVM [12]\nRandom function model MCMC\n\nMethod\nstochastic gradient\nMCMC\nstochastic gradient\n\nIterations [burn-in] Algorithm parameters\n\n1000\n\n10000 [250]\n20 sweeps\n1000 [200]\n\nauthor defaults\nauthor defaults\nauthor defaults\n(see below)\n\nlog mean\n\nU\n\u03b7\n\nlength scale\nscale factor\ntarget noise\n\nWe use standard normal priors on the latent variables\nU and pseudo points \u03b7, and log normal priors for ker-\nnel parameters. Parameters are chosen to favor slice\nsampling acceptance after a reasonable number of it-\nerations, as evaluated over a range of data sets, sum-\nmarized in the table on the right. Balancing computa-\ntional demands, we sampled T 50 times per iteration\nwhilst all other variables were sampled once per itera-\ntion.\nWe performed 5-fold cross validation, predicting links in a held out partition given 4 others. Where\nthe models did not restrict their outputs to values between 0 and 1, we truncated any predictions\nlying outside this range. The following table reports average AUC (area under receiver operating\ncharacteristic) for the various models, with numbers for the top performing model set in bold. Sig-\nni\ufb01cance of results is evaluated by means of a t-test with a p-value of 0.05; results for models not\ndistinguishable from the top performing model in terms of this t-test are also set in bold.\n\nstd width\n0.5\n0.5\n0.5\n0.5\n0.1\n0.5\n4\n-\n-\n2\n\n1\n2\n0.1\n-\n-\n\nAUC results\n\nData set\nLatent dimensions\nPMF\nEigenmodel\n\nHigh school\n1\n0.747\n0.742\nGPLVM 0.744\nRFM 0.815\n\n2\n0.792\n0.806\n0.775\n0.827\n\n3\n0.792\n0.806\n0.782\n0.820\n\n1\n0.729\n0.789\n0.888\n0.907\n\nNIPS\n2\n0.789\n0.818\n0.876\n0.914\n\n3\n0.820\n0.845\n0.883\n0.919\n\n1\n0.787\n0.805\n0.877\n0.903\n\nProtein\n2\n0.810\n0.866\n0.883\n0.910\n\n3\n0.841\n0.882\n0.873\n0.912\n\nThe random function model outperforms the other models in all tests. We also note that in all\nexperiments, a single latent dimension suf\ufb01ces to achieve better performance, even when the other\nmodels use additional latent dimensions.\nThe posterior distribution of \u0398 favors functions de\ufb01ning random array distributions that explain the\ndata well. In this sense, our model \ufb01ts a probability distribution. The standard inference methods\nfor GPLVM and PMF applied to relational data, in contrast, are designed to \ufb01t mean squared error,\nand should therefore be expected to show stronger performance under a mean squared error metric.\nAs the following table shows, this is indeed the case.\n\n7\n\n\fRMSE results\n\nData set\nLatent dimensions\nPMF\nEigenmodel\n\nHigh school\n1\n0.245\n0.244\nGPLVM 0.244\nRFM 0.239\n\n2\n0.242\n0.238\n0.241\n0.234\n\n3\n0.240\n0.236\n0.239\n0.235\n\n1\n0.141\n0.141\n0.112\n0.114\n\nNIPS\n2\n0.135\n0.132\n0.109\n0.111\n\n3\n0.130\n0.124\n0.106\n0.110\n\n1\n0.151\n0.149\n0.139\n0.138\n\nProtein\n2\n0.142\n0.142\n0.137\n0.136\n\n3\n0.139\n0.138\n0.138\n0.136\n\nAn arguably more suitable metric is comparison in terms of conditional edge probability i.e.,\nP (X{ij} | W{ij}) for all i, j in the held out data. These cannot, however, be computed in a meaning-\nful manner for models such as PMF and GPLVM, which assign a Gaussian likelihood to data. The\nnext table hence reports only comparisons to the eigenmodel.\n\nNegative log conditional edge probability3\n\nData set\nLatent dimensions\nEigenmodel\n\nHigh school\n1\n220\nRFM 205\n\n2\n210\n199\n\n3\n200\n201\n\nNIPS\n2\n81\n57\n\n1\n88\n65\n\n3\n75\n56\n\nProtein\n\n2\n92\n75\n\n1\n96\n78\n\n3\n86\n75\n\nRemark 6.1 (Model complexity and lengthscales). Figure 2 provides a visualisation of \u0398 when\nmodeling the protein interactome data using 1 latent dimension. The likelihood of the smooth peak\nis sensitive to the lengthscale of the Gaussian process representation of \u0398. A Gaussian process prior\nintroduces the assumption that \u0398 is continuous. Continuous functions are dense in the space of mea-\nsurable functions, i.e., any measurable function can be arbitrarily well approximated by a continuous\none. The assumption of continuity is therefore not restrictive, but rather the lengthscale of the Gaus-\nsian process determines the complexity of the model a priori. The nonparametric prior placed on\n\u0398 allows the posterior to approximate any function if supported by the data, but by sampling the\nlengthscale we allow the model to quickly select an appropriate level of complexity.\n\n7 Discussion and conclusions\n\nThere has been a tremendous amount of research into modelling matrices, arrays, graphs and rela-\ntional data, but nonparametric Bayesian modeling of such data is essentially uncharted territory. In\nmost modelling circumstances, the assumption of exchangeability amongst data objects is natural\nand fundamental to the model. In this case, the representation results [2, 6, 7] precisely map out the\nscope of possible Bayesian models for exchangeable arrays: Any such model can be interpreted as\na prior on random measurable functions on a suitable space.\nNonparametric Bayesian statistics provides a number of possible priors on random functions, but the\nGaussian process and its modi\ufb01cations are the only well-studied model for almost surely continuous\nfunctions. For this choice of prior, our work provides a general and simple modeling approach\nthat can be motivated directly by the relevant representation results. The model results in both\ninterpretable representations for networks, such as a visualisation of a protein interactome, and has\ncompetitive predictive performance on benchmark data.\n\nAcknowledgments\n\nThe authors would like to thank David Duvenaud, David Knowles and Konstantina Palla for helpful\ndiscussions. PO was supported by an EPSRC Mathematical Sciences Postdoctoral Research Fel-\nlowship (EP/I026827/1). ZG is supported by EPSRC grant EP/I036575/1. DMR is supported by a\nNewton International Fellowship and Emmanuel College.\n\n3The precise calculation implemented is \u2212 log(P (X{ij} | W{ij})) \u00d7 1000 / (Number of held out edges).\n\n8\n\n\fReferences\n[1] Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. (2008). Mixed Membership Stochastic Block-\n\nmodels. Journal of Machine Learning Research (JMLR), 9, 1981\u20132014.\n\n[2] Aldous, D. J. (1981). Representations for partially exchangeable arrays of random variables. Journal of\n\nMultivariate Analysis, 11(4), 581\u2013598.\n\n[3] Aldous, D. J. (2010). More uses of exchangeability: Representations of complex random structures. In\n\nProbability and Mathematical Genetics: Papers in Honour of Sir John Kingman.\n\n[4] Hoff, P. D. (2007). Modeling homophily and stochastic equivalence in symmetric relational data.\n\nAdvances in Neural Information Processing Systems (NIPS), volume 20, pages 657\u2013664.\n\nIn\n\n[5] Hoff, P. D., Raftery, A. E., and Handcock, M. S. (2002). Latent Space Approaches to Social Network\n\nAnalysis. Journal of the American Statistical Association, 97(460), 1090\u20131098.\n\n[6] Hoover, D. N. (1979). Relations on probability spaces and arrays of random variables. Technical report,\n\nInstitute for Advanced Study, Princeton.\n\n[7] Kallenberg, O. (1992). Symmetries on random arrays and set-indexed processes. Journal of Theoretical\n\nProbability, 5(4), 727\u2013765.\n\n[8] Kallenberg, O. (1999). Multivariate Sampling and the Estimation Problem for Exchangeable Arrays. Jour-\n\nnal of Theoretical Probability, 12(3), 859\u2013883.\n\n[9] Kallenberg, O. (2005). Probabilistic Symmetries and Invariance Principles. Springer.\n[10] Kemp, C., Tenenbaum, J., Grif\ufb01ths, T., Yamada, T., and Ueda, N. (2006). Learning systems of concepts\nIn Proceedings of the National Conference on Arti\ufb01cial Intelligence,\n\nwith an in\ufb01nite relational model.\nvolume 21.\n\n[11] Lawrence, N. D. (2005). Probabilistic non-linear principal component analysis with Gaussian process\n\nlatent variable models. Journal of Machine Learning Research (JMLR), 6, 1783\u20131816.\n\n[12] Lawrence, N. D. and Urtasun, R. (2009). Non-linear matrix factorization with Gaussian processes. In\n\nProceedings of the International Conference on Machine Learning (ICML), pages 1\u20138. ACM Press.\n\n[13] Lov\u00b4asz, L. and Szegedy, B. (2006). Limits of dense graph sequences. Journal of Combinatorial Theory\n\nSeries B, 96, 933\u2013957.\n\n[14] Miller, K. T., Grif\ufb01ths, T. L., and Jordan, M. I. (2009). Nonparametric latent feature models for link\n\nprediction. Advances in Neural Information Processing Systems (NIPS), pages 1276\u20131284.\n\n[15] Murray, I., Adams, R. P., and Mackay, D. J. C. (2010). Elliptical slice sampling. Journal of Machine\n\nLearning Research (JMLR), 9, 541\u2013548.\n\n[16] Neal, R. M. (2003). Slice sampling. The Annals of Statistics, 31(3), 705\u2013767. With discussions and a\n\nrejoinder by the author.\n\n[17] Palla, K., Knowles, D. A., and Ghahramani, Z. (2012). An In\ufb01nite Latent Attribute Model for Network\n\nData. In Proceedings of the International Conference on Machine Learning (ICML).\n\n[18] Qui\u02dcnonero Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse approximate gaussian\n\nprocess regression. Journal of Machine Learning Research (JMLR), 6, 1939\u20131959.\n\n[19] Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.\n[20] Roy, D. M. and Teh, Y. W. (2009). The Mondrian process. In Advances in Neural Information Processing\n\nSystems (NIPS).\n\n[21] Salakhutdinov, R. (2008). Probabilistic Matrix Factorisation. In Advances in neural information process-\n\ning systems (NIPS).\n\n[22] Silverman, B. W. (1985). Some aspects of the spline smoothing approach to non-parametric regression\n\ncurve \ufb01tting. Journal of the Royal Statistical Society. Series B (Methodological), 47(1), 1\u201352.\n\n[23] Smola, A. J. and Bartlett, P. (2001). Sparse greedy gaussian process regression. In Advances in Neural\n\nInformation Processing Systems (NIPS). MIT Press.\n\n[24] Titsias, M. K. and Lawrence, N. D. (2008). Ef\ufb01cient sampling for Gaussian process inference using\n\ncontrol variables. In Advances in Neural Information Processing Systems (NIPS), pages 1681\u20131688.\n\n[25] Wahba, G., Lin, X., Gao, F., Xiang, D., Klein, R., and Klein, B. (1999). The bias-variance tradeoff and\n\nthe randomized gacv. In Advances in Neural Information Processing Systems (NIPS).\n\n[26] Wang, Y. J. and Wong, G. Y. (1987). Stochastic Blockmodels for Directed Graphs. Journal of the\n\nAmerican Statistical Association, 82(397), 8\u201319.\n\n[27] Xu, Z., Yan, F., and Qi, Y. (2012). In\ufb01nite Tucker Decomposition: Nonparametric Bayesian Models for\n\nMultiway Data Analysis. In Proceedings of the International Conference on Machine Learning (ICML).\n\n[28] Yan, F., Xu, Z., and Qi, Y. (2011). Sparse matrix-variate Gaussian process blockmodels for network\n\nmodeling. In Proceedings of the International Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI).\n\n9\n\n\f", "award": [], "sourceid": 4572, "authors": [{"given_name": "James", "family_name": "Lloyd", "institution": null}, {"given_name": "Peter", "family_name": "Orbanz", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Daniel", "family_name": "Roy", "institution": null}]}