{"title": "Natural-Parameter Networks: A Class of Probabilistic Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 118, "page_last": 126, "abstract": "Neural networks (NN) have achieved state-of-the-art performance in various applications. Unfortunately in applications where training data is insufficient, they are often prone to overfitting. One effective way to alleviate this problem is to exploit the Bayesian approach by using Bayesian neural networks (BNN). Another shortcoming of NN is the lack of flexibility to customize different distributions for the weights and neurons according to the data, as is often done in probabilistic graphical models. To address these problems, we propose a class of probabilistic neural networks, dubbed natural-parameter networks (NPN), as a novel and lightweight Bayesian treatment of NN. NPN allows the usage of arbitrary exponential-family distributions to model the weights and neurons. Different from traditional NN and BNN, NPN takes distributions as input and goes through layers of transformation before producing distributions to match the target output distributions. As a Bayesian treatment, efficient backpropagation (BP) is performed to learn the natural parameters for the distributions over both the weights and neurons. The output distributions of each layer, as byproducts, may be used as second-order representations for the associated tasks such as link prediction. Experiments on real-world datasets show that NPN can achieve state-of-the-art performance.", "full_text": "Natural-Parameter Networks:\n\nA Class of Probabilistic Neural Networks\n\nHao Wang, Xingjian Shi, Dit-Yan Yeung\n\nHong Kong University of Science and Technology\n\n{hwangaz,xshiab,dyyeung}@cse.ust.hk\n\nAbstract\n\nNeural networks (NN) have achieved state-of-the-art performance in various appli-\ncations. Unfortunately in applications where training data is insuf\ufb01cient, they are\noften prone to over\ufb01tting. One effective way to alleviate this problem is to exploit\nthe Bayesian approach by using Bayesian neural networks (BNN). Another short-\ncoming of NN is the lack of \ufb02exibility to customize different distributions for the\nweights and neurons according to the data, as is often done in probabilistic graphi-\ncal models. To address these problems, we propose a class of probabilistic neural\nnetworks, dubbed natural-parameter networks (NPN), as a novel and lightweight\nBayesian treatment of NN. NPN allows the usage of arbitrary exponential-family\ndistributions to model the weights and neurons. Different from traditional NN\nand BNN, NPN takes distributions as input and goes through layers of transfor-\nmation before producing distributions to match the target output distributions. As\na Bayesian treatment, ef\ufb01cient backpropagation (BP) is performed to learn the\nnatural parameters for the distributions over both the weights and neurons. The\noutput distributions of each layer, as byproducts, may be used as second-order\nrepresentations for the associated tasks such as link prediction. Experiments on\nreal-world datasets show that NPN can achieve state-of-the-art performance.\n\n1\n\nIntroduction\n\nRecently neural networks (NN) have achieved state-of-the-art performance in various applications\nranging from computer vision [12] to natural language processing [20]. However, NN trained by\nstochastic gradient descent (SGD) or its variants is known to suffer from over\ufb01tting especially\nwhen training data is insuf\ufb01cient. Besides over\ufb01tting, another problem of NN comes from the\nunderestimated uncertainty, which could lead to poor performance in applications like active learning.\nBayesian neural networks (BNN) offer the promise of tackling these problems in a principled way.\nEarly BNN works include methods based on Laplace approximation [16], variational inference (VI)\n[11], and Monte Carlo sampling [18], but they have not been widely adopted due to their lack of\nscalability. Some recent advances in this direction seem to shed light on the practical adoption of\nBNN. [8] proposed a method based on VI in which a Monte Carlo estimate of a lower bound on the\nmarginal likelihood is used to infer the weights. Recently, [10] used an online version of expectation\npropagation (EP), called \u2018probabilistic back propagation\u2019 (PBP), for the Bayesian learning of NN,\nand [4] proposed \u2018Bayes by Backprop\u2019 (BBB), which can be viewed as an extension of [8] based on\nthe \u2018reparameterization trick\u2019 [13]. More recently, an interesting Bayesian treatment called \u2018Bayesian\ndark knowledge\u2019 (BDK) was designed to approximate a teacher network with a simpler student\nnetwork based on stochastic gradient Langevin dynamics (SGLD) [1].\nAlthough these recent methods are more practical than earlier ones, several outstanding problems\nremain to be addressed: (1) most of these methods require sampling either at training time [8, 4, 1] or\nat test time [4], incurring much higher cost than a \u2018vanilla\u2019 NN; (2) as mentioned in [1], methods\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fbased on online EP or VI do not involve sampling, but they need to compute the predictive density\nby integrating out the parameters, which is computationally inef\ufb01cient; (3) these methods assume\nGaussian distributions for the weights and neurons, allowing no \ufb02exibility to customize different\ndistributions according to the data as is done in probabilistic graphical models (PGM).\nTo address the problems, we propose natural-parameter networks (NPN) as a class of probabilistic\nneural networks where the input, target output, weights, and neurons can all be modeled by arbitrary\nexponential-family distributions (e.g., Poisson distributions for word counts) instead of being limited\nto Gaussian distributions. Input distributions go through layers of linear and nonlinear transformation\ndeterministically before producing distributions to match the target output distributions (previous\nwork [21] shows that providing distributions as input by corrupting the data with noise plays the\nrole of regularization). As byproducts, output distributions of intermediate layers may be used as\nsecond-order representations for the associated tasks. Thanks to the properties of the exponential\nfamily [3, 19], distributions in NPN are de\ufb01ned by the corresponding natural parameters which can\nbe learned ef\ufb01ciently by backpropagation. Unlike [4, 1], NPN explicitly propagates the estimates of\nuncertainty back and forth in deep networks. This way the uncertainty estimates for each layer of\nneurons are readily available for the associated tasks. Our experiments show that such information is\nhelpful when neurons of intermediate layers are used as representations like in autoencoders (AE). In\nsummary, our main contributions are:\n\n\u2022 We propose NPN as a class of probabilistic neural networks. Our model combines the merits\nof NN and PGM in terms of computational ef\ufb01ciency and \ufb02exibility to customize the types\nof distributions for different types of data.\n\u2022 Leveraging the properties of the exponential family, some sampling-free backpropagation-\ncompatible algorithms are designed to ef\ufb01ciently learn the distributions over weights by\nlearning the natural parameters.\n\u2022 Unlike most probabilistic NN models, NPN obtains the uncertainty of intermediate-layer\nneurons as byproducts, which provide valuable information to the learned representations.\nExperiments on real-world datasets show that NPN can achieve state-of-the-art performance\non classi\ufb01cation, regression, and unsupervised representation learning tasks.\n\n2 Natural-Parameter Networks\n\nThe exponential family refers to an important class of distributions with useful algebraic properties.\nDistributions in the exponential family have the form p(x|\u03b7) = h(x)g(\u03b7) exp{\u03b7T u(x)}, where x is\nthe random variable, \u03b7 denotes the natural parameters, u(x) is a vector of suf\ufb01cient statistics, and\ng(\u03b7) is the normalizer. For a given type of distributions, different choices of \u03b7 lead to different shapes.\nFor example, a univariate Gaussian distribution with \u03b7 = (c, d)T corresponds to N (\u2212 c\nMotivated by this observation, in NPN, only the natural parameters need to be learned to model the\ndistributions over the weights and neurons. Consider an NPN which takes a vector random distribution\n(e.g., a multivariate Gaussian distribution) as input, multiplies it by a matrix random distribution,\ngoes through nonlinear transformation, and outputs another distribution. Since all three distributions\nin the process can be speci\ufb01ed by their natural parameters (given the types of distributions), learning\nand prediction of the network can actually operate in the space of natural parameters. For example, if\nwe use element-wise (factorized) gamma distributions for both the weights and neurons, the NPN\ncounterpart of a vanilla network only needs twice the number of free parameters (weights) and\nneurons since there are two natural parameters for each univariate gamma distribution.\n\n2d ,\u2212 1\n2d ).\n\n2.1 Notation and Conventions\n\nWe use boldface uppercase letters like W to denote matrices and boldface lowercase letters like\nb for vectors. Similarly, a boldface number (e.g., 1 or 0) represents a row vector or a matrix with\nidentical entries. In NPN, o(l) is used to denote the values of neurons in layer l before nonlinear\ntransformation and a(l) is for the values after nonlinear transformation. As mentioned above, NPN\ntries to learn distributions over variables rather than variables themselves. Hence we use letters\nwithout subscripts c, d, m, and s (e.g., o(l) and a(l)) to denote \u2018random variables\u2019 with corresponding\ndistributions. Subscripts c and d are used to denote natural parameter pairs, such as Wc and Wd.\nSimilarly, subscripts m and s are for mean-variance pairs. Note that for clarity, many operations used\nbelow are implicitly element-wise, for example, the square z2, division z\n\u2202b, the\n\nb, partial derivative \u2202z\n\n2\n\n\fm = xi, a(0)\n\nz. For the data D = {(xi, yi)}N\ni=1,\ngamma function \u0393(z), logarithm log z, factorial z!, 1 + z, and 1\n(cid:54)= 0 resemble AE\u2019s denoising effect.) as\nwe set a(0)\ninput of the network and yi denotes the output targets (e.g., labels and word counts). In the following\ntext we drop the subscript i (and sometimes the superscript (l)) for clarity. The bracket (\u00b7,\u00b7) denotes\nconcatenation or pairs of vectors.\n\ns = 0 (Input distributions with a(0)\ns\n\n2.2 Linear Transformation in NPN\n\nc , W(l)\n\nc,ij, W(l)\n\ni,j p(W(l)\n\nc,ij, W(l)\n\nij |W(l)\n\nd ) =(cid:81)\n\nd,ij), where the pair (W(l)\n\nHere we \ufb01rst introduce the linear form of a general NPN. For simplicity, we assume distributions\nwith two natural parameters (e.g., gamma distributions, beta distributions, and Gaussian distribu-\ntions), \u03b7 = (c, d)T , in this section. Speci\ufb01cally, we have factorized distributions on the weight\nmatrices, p(W(l)|W(l)\nd,ij) is the\ncorresponding natural parameters. For b(l), o(l), and a(l) we assume similar factorized distributions.\nIn a traditional NN, the linear transformation follows o(l) = a(l\u22121)W(l) + b(l) where a(l\u22121) is the\noutput from the previous layer. In NN a(l\u22121), W(l), and b(l) are deterministic variables while in\nNPN they are exponential-family distributions, meaning that the result o(l) is also a distribution. For\nconvenience of subsequent computation it is desirable to approximate o(l) using another exponential-\nfamily distribution. We can do this by matching the mean and variance. Speci\ufb01cally, after computing\nm , W(l)\n(W(l)\nd through\nthe mean o(l)\n(a(l\u22121)\n\n(1)\n(2)\n(3)\nwhere \u25e6 denotes the element-wise product and the bijective function f (\u00b7,\u00b7) maps the natural parame-\nd2 ) in gamma distributions).\nters of a distribution into its mean and variance (e.g., f (c, d) = ( c+1\u2212d , c+1\nSimilarly we use f\u22121(\u00b7,\u00b7) to denote the inverse transformation. W(l)\ns are the\nmean and variance of W(l) and b(l) obtained from the natural parameters. The computed o(l)\nm and\no(l)\ns can then be used to recover o(l)\nd , which will subsequently facilitate the feedforward\ncomputation of the nonlinear transformation described in Section 2.3.\n\nc , W(l)\ns ) = f (W(l)\nm and variance o(l)\nm , a(l\u22121)\n) = f (a(l\u22121)\ns = a(l\u22121)\no(l)\nd ) = f\u22121(o(l)\nc , o(l)\n\n, a(l\u22121)\n), o(l)\ns + a(l\u22121)\ns ),\n\nd ) and (b(l)\ns of o(l) as follows:\n\nm = a(l\u22121)\nm \u25e6 W(l)\n(W(l)\n\nm + b(l)\nm ,\nm ) + (a(l\u22121)\n\nd ), we can get o(l)\n\nd\ns W(l)\n\ns ) = f (b(l)\n\nm , and b(l)\n\nm )W(l)\n\ns + b(l)\ns ,\n\nm , W(l)\n\ns , b(l)\n\nc and o(l)\n\n\u25e6 a(l\u22121)\n\nm\n\nm , b(l)\n\nc , b(l)\n\ns\n\nm , o(l)\n\ns\n\nc\n\nm W(l)\n\nc and o(l)\n\n(o(l)\n\n2.3 Nonlinear Transformation in NPN\n\nAfter we obtain the linearly transformed distribution over o(l) de\ufb01ned by natural parameters o(l)\nc and\nd , an element-wise nonlinear transformation v(\u00b7) (with a well de\ufb01ned inverse function v\u22121(\u00b7)) will\no(l)\nbe imposed. The resulting activation distribution is pa(a(l)) = po(v\u22121(a(l)))|v\u22121(cid:48)\n(a(l))|, where po\nis the factorized distribution over o(l) de\ufb01ned by (o(l)\nThough pa(a(l)) may not be an exponential-family distribution, we can approximate it with one,\np(a(l)|a(l)\ns of pa(a(l))\nare obtained, we can compute corresponding natural parameters with f\u22121(\u00b7,\u00b7) (approximation\naccuracy is suf\ufb01cient according to preliminary experiments). The feedforward computation is:\n\u22121(am, as).\n\nd ), by matching the \ufb01rst two moments. Once the mean a(l)\n\npo(o|oc, od)v(o)do, as =\n\npo(o|oc, od)v(o)2do \u2212 a2\n\nm and variance a(l)\n\nc , o(l)\nd ).\n\nm, (ac, ad) = f\n\nc , a(l)\n\nam =\n\n(cid:90)\n\n(cid:90)\n\n(4)\n\nHere the key computational challenge is computing the integrals in Equation (4). Closed-form\nsolutions are needed for their ef\ufb01cient computation. If po(o|oc, od) is a Gaussian distribution, closed-\nform solutions exist for common activation functions like tanh(x) and max(0, x) (details are in\nSection 3.2). Unfortunately this is not the case for other distributions. Leveraging the convenient\nform of the exponential family, we \ufb01nd that it is possible to design activation functions so that the\nintegrals for non-Gaussian distributions can also be expressed in closed form.\nTheorem 1. Assume an exponential-family distribution po(x|\u03b7) = h(x)g(\u03b7) exp{\u03b7T u(x)}, where\nthe vector u(x) = (u1(x), u2(x), . . . , uM (x))T (M is the number of natural parameters). If activa-\n\ntion function v(x) = r \u2212 q exp(\u2212\u03c4 ui(x)) is used, the \ufb01rst two moments of v(x),(cid:82) po(x|\u03b7)v(x)dx\n\n3\n\n\fTable 1: Activation Functions for Exponential-Family Distributions\n\nDistribution\nBeta Distribution\nRayleigh Distribution\nGamma Distribution\nPoisson Distribution\nGaussian Distribution\n\n\u0393(c)\u0393(d) xc\u22121(1 \u2212 x)d\u22121\n\u03c32 exp{\u2212 x2\n\u0393(c) dcxc\u22121 exp{\u2212dx}\n\nProbability Density Function\np(x) = \u0393(c+d)\np(x) = x\np(x) = 1\np(x) = cx exp{\u2212c}\nx!\n\u2212 1\np(x) = (2\u03c0\u03c32)\n\n2 exp{\u2212 1\n\n2\u03c32 }\n\nActivation Function\nqx\u03c4 , \u03c4 \u2208 (0, 1)\nr \u2212 q exp{\u2212\u03c4 x2}\nr \u2212 q exp{\u2212\u03c4 x}\nr \u2212 q exp{\u2212\u03c4 x}\nReLU, tanh, and sigmoid\n\nSupport\n\n[0, 1]\n(0, +\u221e)\n(0, +\u221e)\nNonnegative interger\n(\u2212\u221e, +\u221e)\n\nand(cid:82) po(x|\u03b7)v(x)2dx, can be expressed in closed form. Here i \u2208 {1, 2, . . . , M} (different ui(x)\nlet \u03b7 = (\u03b71, \u03b72, . . . , \u03b7M ), (cid:101)\u03b7 = (\u03b71, \u03b72, . . . , \u03b7i \u2212 \u03c4, . . . , \u03b7M ), and (cid:98)\u03b7 =\n\ncorresponds to a different set of activation functions) and r, q, and \u03c4 are constants.\n\nProof. We \ufb01rst\n(\u03b71, \u03b72, . . . , \u03b7i \u2212 2\u03c4, . . . , \u03b7M ). The \ufb01rst moment of v(x) is\n\n2\u03c32 (x \u2212 \u00b5)2}\n\nE(v(x)) = r \u2212 q\n\nh(x)g(\u03b7) exp{\u03b7T u(x) \u2212 \u03c4 ui(x)} dx\n\ng((cid:101)\u03b7) exp{(cid:101)\u03b7T u(x)} dx = r \u2212 q\ng((cid:98)\u03b7)\nSimilarly the second moment can be computed as E(v(x)2) = r2 + q2 g(\u03b7)\n\ng((cid:101)\u03b7)\n\n= r \u2212 q\n\ng(\u03b7)\n\nh(x)\n\ng(\u03b7)\n\ng((cid:101)\u03b7)\ng((cid:101)\u03b7)\n\u2212 2rq g(\u03b7)\n\n.\n\n.\n\nA more detailed proof is provided in the supplementary material. With Theorem 1, what remains is to\n\ufb01nd the constants that make v(x) strictly increasing and bounded (Table 1 shows some exponential-\nfamily distributions and their possible activation functions). For example in Equation (4), if v(x) =\nr \u2212 q exp(\u2212\u03c4 x), am = r \u2212 q( od\nIn the backpropagation, for distributions with two natural parameters the gradient consists of two\nterms. For example, \u2202E\n\u2202oc\n\nod+\u03c4 )oc for the gamma distribution.\n\n, where E is the error term of the network.\n\n\u25e6 \u2202am\n\n= \u2202E\n\u2202am\n\n\u25e6 \u2202as\n\n+ \u2202E\n\u2202as\n\n\u2202oc\n\n\u2202oc\n\n(cid:90)\n(cid:90)\n\nfor l = 1 : L do\n\nAlgorithm 1 Deep Nonlinear NPN\n1: Input: Data D = {(xi, yi)}N\n2: for t = 1 : T do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\n\nend for\nCompute the error E from (o(L)\nfor l = L : 1 do\n\u2202E\n\nend for\nUpdate W(l)\n\nc , W(l)\n\nd , b(l)\n\nCompute\n\n, \u2202E\n\u2202b(l)\nm\n\n\u2202W(l)\nm\n\n\u2202W(l)\n\n\u2202E\n\n,\n\nc\n\ns\n\ni=1, number of iterations T , learning rate \u03c1t, number of layers L.\n\nApply Equation (1)-(4) to compute the linear and nonlinear transformation in layer l.\n\n, o(L)\n\nd ) or (a(L)\n\nc\n\n, a(L)\nd ).\n\n, and \u2202E\n\u2202b(l)\ns\n\n. Compute\n\n\u2202E\n\n\u2202W(l)\n\nc\n\n,\n\n\u2202E\n\n\u2202W(l)\nd\n\n, \u2202E\n\u2202b(l)\n\nc\n\n, and \u2202E\n\u2202b(l)\nd\n\n.\n\nc , and b(l)\n\nd in all layers.\n\n2.4 Deep Nonlinear NPN\n\nNaturally layers of nonlinear NPN can be stacked to form a deep NPN1, as shown in Algorithm 12. A\ndeep NPN is in some sense similar to a PGM with a chain structure. Unlike PGM in general, however,\nNPN does not need costly inference algorithms like variational inference or Markov chain Monte\nCarlo. For some chain-structured PGM (e.g, hidden Markov models), ef\ufb01cient inference algorithms\nalso exist due to their special structure. Similarly, the Markov property enables NPN to be ef\ufb01ciently\ntrained in an end-to-end backpropagation learning fashion in the space of natural parameters.\nPGM is known to be more \ufb02exible than NN in the sense that it can choose different distributions to\ndepict different relationships among variables. A major drawback of PGM is its scalability especially\n\n1Although the approximation accuracy may decrease as NPN gets deeper during feedforward computation, it\n\ncan be automatically adjusted according to data during backpropagation.\n\n2Note that since the \ufb01rst part of Equation (1) and the last part of Equation (4) are canceled out, we can\n\ndirectly use (a(l)\n\nm , a(l)\n\ns ) without computing (a(l)\n\nc , a(l)\n\nd ) here.\n\n4\n\n\fwhen the PGM is deep. Different from PGM, NN stacks relatively simple computational layers and\nlearns the parameters using backpropagation, which is computationally more ef\ufb01cient than most\nalgorithms for PGM. NPN has the potential to get the best of both worlds. In terms of \ufb02exibility,\ndifferent types of exponential-family distributions can be chosen for the weights and neurons. Using\ngamma distributions for both the weights and neurons in NPN leads to a deep and nonlinear version\nof nonnegative matrix factorization [14] while an NPN with the Bernoulli distribution and sigmoid\nactivation resembles a Bayesian treatment of sigmoid belief networks [17]. If Poisson distributions\nare chosen for the neurons, NPN becomes a neural analogue of deep Poisson factor analysis [26, 9].\nNote that similar to the weight decay in NN, we may add the KL divergence between the prior\ndistributions and the learned distributions on the weights to the error E for regularization (we use\nisotropic Gaussian priors in the experiments). In NPN, the chosen prior distributions correspond to\npriors in Bayesian models and the learned distributions correspond to the approximation of posterior\ndistributions on weights. Note that the generative story assumed here is that weights are sampled\nfrom the prior, and then output is generated (given all data) from these weights.\n\n3 Variants of NPN\n\nIn this section, we introduce three NPN variants with different properties to demonstrate the \ufb02exibility\nand effectiveness of NPN. Note that in practice we use a transformed version of the natural parameters,\nreferred to as proxy natural parameters here, instead of the original ones for computational ef\ufb01ciency.\nFor example, in gamma distributions p(x|c, d) = \u0393(c)\u22121dcxc\u22121 exp(\u2212dx), we use proxy natural\nparameters (c, d) during computation rather than the natural parameters (c \u2212 1,\u2212d).\n\n3.1 Gamma NPN\n\nThe gamma distribution with support over positive values is an important member of the exponential\nfamily. The corresponding probability density function is p(x|c, d) = \u0393(c)\u22121dcxc\u22121 exp(\u2212dx) with\n(c \u2212 1,\u2212d) as its natural parameters (we use (c, d) as proxy natural parameters). If we assume\ngamma distributions for W(l), b(l), o(l), and a(l), an AE formed by NPN becomes a deep and\nnonlinear version of nonnegative matrix factorization [14]. To see this, note that this AE with\nactivation v(x) = x and zero biases b(l) is equivalent to \ufb01nding a factorization of matrix X such that\nW(l) where H denotes the middle-layer neurons and W(l) has nonnegative entries\nd can be learned\n\nX = H(cid:81)L\n\nc , and b(l)\n\nc , W(l)\n\nd , b(l)\n\nd ) = f\u22121(o(l)\n\nd2 ) to compute (W(l)\n\nfrom gamma distributions. In this gamma NPN, parameters W(l)\nfollowing Algorithm 1. We detail the algorithm as follows:\nLinear Transformation: Since gamma distributions are assumed here, we can use the function\nd ), and\nf (c, d) = ( c\n(o(l)\nNonlinear Transformation: With the proxy natural parameters for the gamma distributions over\no(l), the mean a(l)\nfor the nonlinearly transformed distribution over a(l) would\nbe obtained with Equation (4). Following Theorem 1, closed-form solutions are possible with\nv(x) = r(1 \u2212 exp(\u2212\u03c4 x)) (r = q and ui(x) = x) where r and \u03c4 are constants. Using this new\nactivation function, we have (see Section 2.1 and 6.1 of the supplementary material for details on the\nfunction and derivation):\n\nc , b(l)\ns ) during the probabilistic linear transformation in Equation (1)-(3).\n\nm and variance a(l)\ns\n\ns ) = f (W(l)\n\ns ) = f (b(l)\n\nd ), (b(l)\n\nm , W(l)\n\nc , W(l)\n\nm , b(l)\n\nm , o(l)\n\nc , o(l)\n\nd , c\n\nam =\n\npo(o|oc, od)v(o)do = r(1 \u2212 ooc\n\u0393(oc)\n\nd\n\n\u25e6 \u0393(oc) \u25e6 (od + \u03c4 )\u2212oc ) = r(1 \u2212 (\n\nod\n\nod + \u03c4\n\n)oc),\n\nl= L\n2\n\n(cid:90)\n\nas = r2((\n\nod\n\nod + 2\u03c4\n\n)oc \u2212 (\n\nod\n\nod + \u03c4\n\n)2oc).\n\nError: With o(L)\n\nc\n\nand o(L)\n\nd , we can compute the regression error E as the negative log-likelihood:\n\nE = (log \u0393(o(L)\n\nc\n\n) \u2212 o(L)\n\nc\n\n\u25e6 log o(L)\n\nd \u2212 (o(L)\n\nc \u2212 1) \u25e6 log y + o(L)\n\nd\n\n\u25e6 y)1T ,\n\nwhere y is the observed output corresponding to x. For classi\ufb01cation, cross-entropy loss can be used\nas E. Following the computation \ufb02ow above, BP can be used to learn W(l)\n\nc , W(l)\n\nd , b(l)\n\nc , and b(l)\nd .\n\n5\n\n\fFigure 1: Predictive distributions for PBP, BDK, dropout NN, and NPN. The shaded regions corre-\nspond to \u00b13 standard deviations. The black curve is the data-generating function and blue curves\nshow the mean of the predictive distributions. Red stars are the training data.\n3.2 Gaussian NPN\n\nDifferent from the gamma distribution which has support over positive values only, the Gaussian\ndistribution, also an exponential-family distribution, can describe real-valued random variables. This\nmakes it a natural choice for NPN. We refer to this NPN variant with Gaussian distributions over both\nthe weights and neurons as Gaussian NPN. Details of Algorithm 1 for Gaussian NPN are as follows:\nLinear Transformation: Besides support over real values, another property of Gaussian distributions\nis that the mean and variance can be used as proxy natural parameters, leading to an identity mapping\nfunction f (c, d) = (c, d) which cuts the computation cost. We can use this function to compute\n(W(l)\nm , o(l)\ns )\nduring the probabilistic linear transformation in Equation (1)-(3).\nNonlinear Transformation: If the sigmoid activation v(x) = \u03c3(x) =\n1+exp(\u2212x) is used, am in\nEquation (4) would be (convolution of Gaussian with sigmoid is approximated by another sigmoid):\n\nd ) = f\u22121(o(l)\n\nd ), and (o(l)\n\ns ) = f (W(l)\n\ns ) = f (b(l)\n\nd ), (b(l)\n\nm , W(l)\n\nc , W(l)\n\nm , b(l)\n\nc , b(l)\n\nc , o(l)\n\n1\n\nam =\n\nN (o|oc, diag(od)) \u25e6 \u03c3(o)do \u2248 \u03c3(\n\noc\n\n),\n\n(5)\n\n(cid:90)\n(cid:90)\n\n2\n\n(1 + \u03b6 2od) 1\nm \u2248 \u03c3(\n\n(6)\n\n\u221a\n\n\u03b1(oc + \u03b2)\n\n) \u2212 a2\nm,\n\n(1 + \u03b6 2\u03b12od)1/2\n\n2 + 1), and \u03b6 2 = \u03c0/8. Similar approximation can be applied for\n\nN (o|oc, diag(od)) \u25e6 \u03c3(o)2do \u2212 a2\n\u221a\n2, \u03b2 = \u2212 log(\n\nas =\nwhere \u03b1 = 4 \u2212 2\nactivation v(x) = tanh(x) since tanh(x) = 2\u03c3(2x) \u2212 1.\nIf the ReLU activation v(x) = max(0, x) is used, we can use the techniques in [6] to obtain the \ufb01rst\ntwo moments of max(z1, z2) where z1 and z2 are Gaussian random variables. Full derivation for\nv(x) = \u03c3(x), v(x) = tanh(x), and v(x) = max(0, x) is left to the supplementary material.\nError: With o(L)\nand o(L)\nin the last layer, we can then compute the error E as the KL divergence\nd ))(cid:107)N (ym, diag(\u0001))), where \u0001 is a vector with all entries equal to a small\nKL(N (o(L)\nd )1T \u2212 K log \u0001). For\nvalue \u0001. Hence the error E = 1\nclassi\ufb01cation tasks, cross-entropy loss is used. Following the computation \ufb02ow above, BP can be\nused to learn W(l)\n\nc \u2212 y)T \u2212 K + (log o(L)\n\n1T + ( 1\no(L)\n\n, diag(o(L)\n\n2 ( \u0001\no(L)\n\nc , W(l)\n\n)(o(L)\n\nd , b(l)\n\nc , and b(l)\nd .\n\nd\n\nd\n\nc\n\nd\n\nc\n\n3.3 Poisson NPN\n\nThe Poisson distribution, as another member of the exponential family, is often used to model counts\n(e.g., counts of words, topics, or super topics in documents). Hence for text modeling, it is natural to\nassume Poisson distributions for neurons in NPN. Interestingly, this design of Poisson NPN can be\nseen as a neural analogue of some Poisson factor analysis models [26].\nBesides closed-form nonlinear transformation, another challenge of Poisson NPN is to map the pair\n(o(l)\nc of Poisson distributions. According to the central limit theorem,\nm \u2212 1)2 + 8o(l)\nwe have o(l)\n(2o(l)\ns ) (see Section 3 and 6.3 of the supplementary\nmaterial for proofs, justi\ufb01cations, and detailed derivation of Poisson NPN).\n\ns ) to the single parameter o(l)\n\nm \u2212 1 +\n\n4 (2o(l)\n\nm , o(l)\n\n(cid:113)\n\nc = 1\n\n4 Experiments\n\nIn this section we evaluate variants of NPN and other state-of-the-art methods on four real-world\ndatasets. We use Matlab (with GPU) to implement NPN, AE variants, and the \u2018vanilla\u2019 NN trained\nwith dropout SGD (dropout NN). For other baselines, we use the Theano library [2] and MXNet [5].\n\n6\n\n\u22126\u22124\u221220246\u2212100\u221280\u221260\u221240\u221220020406080100YX\fTable 2: Test Error Rates on MNIST\n\nMethod\nError\n\nBDK\nBBB\n1.38% 1.34%\nTable 3: Test Error Rates for Different Size of Training Data\n\nDropout1 Dropout2\n\n1.33%\n\n1.40%\n\ngamma NPN Gaussian NPN\n\n1.27%\n\n1.25%\n\nSize\nNPN\nDropout\nBDK\n\n4.1 Toy Regression Task\n\n500\n\n100\n\n2,000\n\n10,000\n29.97% 13.79% 7.89% 3.28%\n32.58% 15.39% 8.78% 3.53%\n30.08% 14.34% 8.31% 3.55%\n\nTo gain some insights into NPN, we start with a toy 1d regression task so that the predicted mean and\nvariance can be visualized. Following [1], we generate 20 points in one dimension from a uniform\ndistribution in the interval [\u22124, 4]. The target outputs are sampled from the function y = x3 + \u0001n,\nwhere \u0001n \u223c N (0, 9). We \ufb01t the data with the Gaussian NPN, BDK, and PBP (see the supplementary\nmaterial for detailed hyperparameters). Figure 1 shows the predicted mean and variance of NPN,\nBDK, and PBP along with the mean provided by the dropout NN (for larger versions of \ufb01gures please\nrefer to the end of the supplementary materials). As we can see, the variance of PBP, BDK, and NPN\ndiverges as x is farther away from the training data. Both NPN\u2019s and BDK\u2019s predictive distributions\nare accurate enough to keep most of the y = x3 curve inside the shaded regions with relatively low\nvariance. An interesting observation is that the training data points become more scattered when\nx > 0. Ideally, the variance should start diverging from x = 0, which is what happens in NPN.\nHowever, PBP and BDK are not sensitive enough to capture this dispersion change. In another dataset,\nBoston Housing, the root mean square error for PBP, BDK, and NPN is 3.01, 2.82, and 2.57.\n\n4.2 MNIST Classi\ufb01cation\n\nThe MNIST digit dataset consists of 60,000 training images and 10,000 test images. All images\nare labeled as one of the 10 digits. We train the models with 50,000 images and use 10,000 images\nfor validation. Networks with a structure of 784-800-800-10 are used for all methods, since 800\nworks best for the dropout NN (denoted as Dropout1 in Table 2) and BDK (BDK with a structure of\n784-400-400-10 achieves an error rate of 1.41%). We also try the dropout NN with twice the number\nof hidden neurons (Dropout2 in Table 2) for fair comparison. For BBB, we directly quote their results\nfrom [4]. We implement BDK and NPN using the same hyperparameters as in [1] whenever possible.\nGaussian priors are used for NPN (see the supplementary material for detailed hyperparameters).\nAs shown in Table 2, BDK and BBB achieve comparable performance\nwith dropout NN (similar to [1], PBP is not included in the comparison\nsince it supports regression only), and gamma NPN slightly outperforms\ndropout NN. Gaussian NPN is able to achieve a lower error rate of\n1.25%. Note that BBB with Gaussian priors can only achieve an error\nrate of 1.82%; 1.34% is the result of using Gaussian mixture priors. For\nreference, the error rate for dropout NN with 1600 neurons in each hidden\nlayer is 1.40%. The time cost per epoch is 18.3s, 16.2s, and 6.4s for NPN,\nBDK, NN respectively. Note that BDK is in C++ and NPN is in Matlab.\nTo evaluate NPN\u2019s ability as a Bayesian treatment to avoid over\ufb01tting,\nwe vary the size of the training set (from 100 to 10,000 data points) and compare the test error rates.\nAs shown in Table 3, the margin between the Gaussian NPN and dropout NN increases as the training\nset shrinks. Besides, to verify the effectiveness of the estimated uncertainty, we split the test set into\n9 subsets according NPN\u2019s estimated variance (uncertainty) a(L)\ns 1T for each sample and show the\naccuracy for each subset in Figure 2. We can \ufb01nd that the more uncertain NPN is, the lower the\naccuracy, indicating that the estimated uncertainty is well calibrated.\n\nFigure 2: Classi\ufb01cation accuracy\nfor different variance (uncertainty).\nNote that \u20181\u2019 in the x-axis means\ns 1T \u2208 [0, 0.04), \u20182\u2019 means\na(L)\ns 1T \u2208 [0.04, 0.08), etc.\na(L)\n\n4.3 Second-Order Representation Learning\n\nBesides classi\ufb01cation and regression, we also consider the problem of unsupervised representation\nlearning with a subsequent link prediction task. Three real-world datasets, Citeulike-a, Citeulike-t,\nand arXiv, are used. The \ufb01rst two datasets are from [22, 23], collected separately from CiteULike in\ndifferent ways to mimic different real-world settings. The third one is from arXiv as one of the SNAP\ndatasets [15]. Citeulike-a consists of 16,980 documents, 8,000 terms, and 44,709 links (citations).\n\n7\n\n12345678900.20.40.60.81VarianceAccuracy\fMethod\nCiteulike-a\nCiteulike-t\narXiv\n\nSAE\n1104.7\n2109.8\n4232.7\n\nTable 4: Link Rank on Three Datasets\nSDAE\n992.4\n1356.8\n2916.1\n\ngamma NPN\n851.7 (935.8)\n1342.3 (1400.7)\n2796.4 (3038.8)\n\nVAE\n980.8\n1599.6\n3367.2\n\nGaussian NPN\n750.6 (823.9)\n1280.4 (1330.7)\n2687.9 (2923.8)\n\nPoisson NPN\n690.9 (5389.7)\n1354.1 (9117.2)\n2684.1 (10791.3)\n\n[!h]\n\nCiteulike-t consists of 25,975 documents, 20,000 terms, and 32,565 links. The last dataset, arXiv,\nconsists of 27,770 documents, 8,000 terms, and 352,807 links.\nThe task is to perform unsupervised representation learning before feeding the extracted representa-\ntions (middle-layer neurons) into a Bayesian LR algorithm [3]. We use the stacked autoencoder (SAE)\n[7], stacked denoising autoencoder (SDAE) [21], variational autoencoder (VAE) [13] as baselines\n(hyperparameters like weight decay and dropout rate are chosen by cross validation). As in SAE,\nwe use different variants of NPN to form autoencoders where both the input and output targets are\nbag-of-words (BOW) vectors for the documents. The network structure for all models is B-100-50\n(B is the number of terms). Please refer to the supplementary material for detailed hyperparameters.\nOne major advantage of NPN over SAE and SDAE is that the learned repre-\nsentations are distributions instead of point estimates. Since representations\nfrom NPN contain both the mean and variance, we call them second-\norder representations. Note that although VAE also produces second-order\nrepresentations, the variance part is simply parameterized by multilayer\nperceptrons while NPN\u2019s variance is naturally computed through propaga-\ntion of distributions. These 50-dimensional representations with both mean\nand variance are fed into a Bayesian LR algorithm for link prediction (for\ndeterministic AE the variance is set to 0).\nWe use links among 80% of the nodes (documents) to train the Bayesian LR and use other links as\nthe test set. link rank and AUC (area under the ROC curve) are used as evaluation metrics. The link\nrank is the average rank of the observed links from test nodes to training nodes. We compute the\nAUC for every test node and report the average values. By de\ufb01nition, lower link rank and higher\nAUC indicate better predictive performance and imply more powerful representations.\nTable 4 shows the link rank for different models. For fair comparison we also try all baselines with\ndouble budget (a structure of B-200-50) and report whichever has higher accuracy. As we can see, by\ntreating representations as distributions rather than points in a vector space, NPN is able to achieve\nmuch lower link rank than all baselines, including VAE with variance information. The numbers in\nthe brackets show the link rank of NPN if we discard the variance information. The performance\ngain from variance information veri\ufb01es the effectiveness of the variance (uncertainty) estimated by\nNPN. Among different variants of NPN, the Gaussian NPN seems to perform better in datasets with\nfewer words like Citeulike-t (only 18.8 words per document). The Poisson NPN, as a more natural\nchoice to model text, achieves the best performance in datasets with more words (Citeulike-a and\narXiv). The performance in AUC is consistent with that in terms of the link rank (see Section 4 of the\nsupplementary material). To further verify the effectiveness of the estimated uncertainty, we plot the\nreconstruction error and the variance o(L)\ns 1T for each data point of Citeulike-a in Figure 3. As we\ncan see, higher uncertainty often indicates not only higher reconstruction error E but also higher\nvariance in E.\n\nFigure 3: Reconstruction error\nand estimated uncertainty for\neach data point in Citeulike-a.\n\n5 Conclusion\n\nWe have introduced a family of models, called natural-parameter networks, as a novel class of proba-\nbilistic NN to combine the merits of NN and PGM. NPN regards the weights and neurons as arbitrary\nexponential-family distributions rather than just point estimates or factorized Gaussian distributions.\nSuch \ufb02exibility enables richer descriptions of hierarchical relationships among latent variables and\nadds another degree of freedom to customize NN for different types of data. Ef\ufb01cient sampling-free\nbackpropagation-compatible algorithms are designed for the learning of NPN. Experiments show that\nNPN achieves state-of-the-art performance on classi\ufb01cation, regression, and representation learning\ntasks. As possible extensions of NPN, it would be interesting to connect NPN to arbitrary PGM to\nform fully Bayesian deep learning models [24, 25], allowing even richer descriptions of relationships\namong latent variables. It is also worth noting that NPN cannot be de\ufb01ned as generative models\nand, unlike PGM, the same NPN model cannot be used to support multiple types of inference (with\ndifferent observed and hidden variables). We will try to address these limitations in our future work.\n\n8\n\n05101520050100150200250300350VarianceReconstruction error\fReferences\n[1] A. K. Balan, V. Rathod, K. P. Murphy, and M. Welling. Bayesian dark knowledge. In NIPS, 2015.\n\n[2] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio.\nTheano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS\n2012 Workshop, 2012.\n\n[3] C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., Secaucus, NJ,\n\nUSA, 2006.\n\n[4] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural network. In\n\nICML, 2015.\n\n[5] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A \ufb02ex-\nible and ef\ufb01cient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274,\n2015.\n\n[6] C. E. Clark. The greatest of a \ufb01nite set of random variables. Operations Research, 9(2):145\u2013162, 1961.\n\n[7] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. Book in preparation for MIT Press, 2016.\n\n[8] A. Graves. Practical variational inference for neural networks. In NIPS, 2011.\n\n[9] R. Henao, Z. Gan, J. Lu, and L. Carin. Deep poisson factor modeling. In NIPS, 2015.\n\n[10] J. M. Hern\u00e1ndez-Lobato and R. Adams. Probabilistic backpropagation for scalable learning of Bayesian\n\nneural networks. In ICML, 2015.\n\n[11] G. E. Hinton and D. Van Camp. Keeping the neural networks simple by minimizing the description length\n\nof the weights. In COLT, 1993.\n\n[12] A. Karpathy and F. Li. Deep visual-semantic alignments for generating image descriptions. In CVPR,\n\n2015.\n\n[13] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. CoRR, abs/1312.6114, 2013.\n\n[14] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2001.\n\n[15] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.\n\nstanford.edu/data, June 2014.\n\n[16] J. MacKay David. A practical Bayesian framework for backprop networks. Neural computation, 1992.\n\n[17] R. M. Neal. Learning stochastic feedforward networks. Department of Computer Science, University of\n\nToronto, 1990.\n\n[18] R. M. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.\n\n[19] R. Ranganath, L. Tang, L. Charlin, and D. M. Blei. Deep exponential families. In AISTATS, 2015.\n\n[20] R. Salakhutdinov and G. E. Hinton. Semantic hashing. Int. J. Approx. Reasoning, 50(7):969\u2013978, 2009.\n\n[21] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoencoders:\nLearning useful representations in a deep network with a local denoising criterion. JMLR, 11:3371\u20133408,\n2010.\n\n[22] C. Wang and D. M. Blei. Collaborative topic modeling for recommending scienti\ufb01c articles. In KDD,\n\n2011.\n\n[23] H. Wang, B. Chen, and W.-J. Li. Collaborative topic regression with social regularization for tag recom-\n\nmendation. In IJCAI, 2013.\n\n[24] H. Wang, N. Wang, and D. Yeung. Collaborative deep learning for recommender systems. In KDD, 2015.\n\n[25] H. Wang and D. Yeung. Towards Bayesian deep learning: A framework and some existing methods. TKDE,\n\n2016, to appear.\n\n[26] M. Zhou, L. Hannah, D. B. Dunson, and L. Carin. Beta-negative binomial process and poisson factor\n\nanalysis. In AISTATS, 2012.\n\n9\n\n\f", "award": [], "sourceid": 83, "authors": [{"given_name": "Hao", "family_name": "Wang", "institution": "HKUST"}, {"given_name": "Xingjian", "family_name": "SHI", "institution": "Hong Kong University of Science and Technology"}, {"given_name": "Dit-Yan", "family_name": "Yeung", "institution": "HKUST, Hong Kong"}]}