{"title": "A Probabilistic Framework for Nonlinearities in Stochastic Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4486, "page_last": 4495, "abstract": "We present a probabilistic framework for nonlinearities, based on doubly truncated Gaussian distributions. By setting the truncation points appropriately, we are able to generate various types of nonlinearities within a unified framework, including sigmoid, tanh and ReLU, the most commonly used nonlinearities in neural networks. The framework readily integrates into existing stochastic neural networks (with hidden units characterized as random variables), allowing one for the first time to learn the nonlinearities alongside model weights in these networks. Extensive experiments demonstrate the performance improvements brought about by the proposed framework when integrated with the restricted Boltzmann machine (RBM), temporal RBM and the truncated Gaussian graphical model (TGGM).", "full_text": "A Probabilistic Framework for Nonlinearities in\n\nStochastic Neural Networks\n\nQinliang Su\n\nXuejun Liao\n\nLawrence Carin\n\nDepartment of Electrical and Computer Engineering\n\nDuke University, Durham, NC, USA\n\n{qs15, xjliao, lcarin}@duke.edu\n\nAbstract\n\nWe present a probabilistic framework for nonlinearities, based on doubly trun-\ncated Gaussian distributions. By setting the truncation points appropriately, we\nare able to generate various types of nonlinearities within a uni\ufb01ed framework,\nincluding sigmoid, tanh and ReLU, the most commonly used nonlinearities in\nneural networks. The framework readily integrates into existing stochastic neural\nnetworks (with hidden units characterized as random variables), allowing one for\nthe \ufb01rst time to learn the nonlinearities alongside model weights in these networks.\nExtensive experiments demonstrate the performance improvements brought about\nby the proposed framework when integrated with the restricted Boltzmann machine\n(RBM), temporal RBM and the truncated Gaussian graphical model (TGGM).\n\n1\n\nIntroduction\n\nA typical neural network is composed of nonlinear units connected by linear weights, and such a\nnetwork is known to have universal approximation ability under mild conditions about the nonlinearity\nused at each unit [1, 2]. In previous work, the choice of nonlinearity has commonly been taken as a\npart of network design rather than network learning, and the training algorithms for neural networks\nhave been mostly concerned with learning the linear weights. However, it is becoming increasingly\nunderstood that the choice of nonlinearity plays an important role in model performance. For example,\n[3] showed advantages of recti\ufb01ed linear units (ReLU) over sigmoidal units in using the restricted\nBoltzmann machine (RBM) [4] to pre-train feedforward ReLU networks. It was further shown in [5]\nthat recti\ufb01ed linear units (ReLU) outperform sigmoidal units in a generative network under the same\nundirected and bipartite structure as the RBM.\nA number of recent works have reported bene\ufb01ts of learning nonlinear units along with the inter-unit\nweights. These methods are based on using parameterized nonlinear functions to activate each unit\nin a neural network, with the unit-dependent parameters incorporated into the data-driven training\nalgorithms. In particular, [6] considered the adaptive piecewise linear (APL) unit de\ufb01ned by a mixture\nof hinge-shaped functions, and [7] used nonparametric Fourier basis expansion to construct the\nactivation function of each unit. The maxout network [8] employs piecewise linear convex (PLC)\nunits, where each PLC unit is obtained by max-pooling over multiple linear units. The PLC units were\nextended to Lp units in [9] where the normalized Lp norm of multiple linear units yields the output\nof an Lp unit. All these methods have been developed for learning the deterministic characteristics\nof a unit, lacking a stochastic unit characterization. The deterministic nature limits these methods\nfrom being easily applied to stochastic neural networks (for which the hidden units are random\nvariables, rather than being characterized by a deterministic function), such as Boltzmann machines\n[10], restricted Boltzmann machines [11], and sigmoid belief networks (SBNs) [12].\nWe propose a probabilistic framework to unify the sigmoid, hyperbolic tangent (tanh) and ReLU\nnonlinearities, most commonly used in neural networks. The proposed framework represents a\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fcharacterization of the unit is obtained as E(h|z, \u03be) (cid:44)(cid:82) h p(h|z, \u03be)dh. We show that the sigmoid,\n\nunit h probabilistically as p(h|z, \u03be), where z is the total net contribution that h receives from other\nunits, and \u03be represents the learnable parameters. By taking the expectation of h, a deterministic\ntanh and ReLU are well approximated by E(h|z, \u03be) under appropriate settings of \u03be. This is different\nfrom [13], in which nonlinearities were induced by the additive noises of different variances, making\nthe model learning much more expensive and nonlinearity producing less \ufb02exible. Additionally,\nmore-general nonlinearities may be constituted or learned, with these corresponding to distinct\nsettings of \u03be. A neural unit represented by the proposed framework is named a truncated Gaussian\n(TruG) unit because the framework is built upon truncated Gaussian distributions. Because of the\ninherent stochasticity, TruG is particularly useful in constructing stochastic neural networks.\nThe TruG generalizes the probabilistic ReLU in [14, 5] to a family of stochastic nonlinearities, with\nwhich one can perform two tasks that could not be done previously: (i) One can interchangeably use\none nonlinearity in place of another under the same network structure, as long as they are both in\nthe TruG family; for example, the ReLU-based stochastic networks in [14, 5] can be extended to\nnew networks based on probabilistic tanh or sigmoid nonlinearities, and the respective algorithms in\n[14, 5] can be employed to train the associated new models with little modi\ufb01cation; (ii) Any stochastic\nnetwork constructed with the TruG can learn the nonlinearity alongside the network weights, by\nmaximizing the likelihood function of \u03be given the training data. We can learn the nonlinearity at the\nunit level, with each TruG unit having its own parameters; or we can learn the nonlinearity at the\nmodel level, with the entire network sharing the same parameters for all its TruG units. The different\nchoices entail only minor changes in the update equation of \u03be, as will be seen subsequently.\nWe integrate the TruG framework into three existing stochastic networks: the RBM, temporal RBM\n[15] and feedforward TGGM [14], leading to three new models referred to as TruG-RBM, temporal\nTruG-RBM and TruG-TGGM, respectively. These new models are evaluated against the original\nmodels in extensive experiments to assess the performance gains brought about by the TruG. To\nconserve space, all propositions in this paper are proven in the Supplementary Material.\n\n2 TruG: A Probabilistic Framework for Nonlinearities in Neural Networks\nFor a unit h that receives net contribution z from other units, we propose to relate h to z through the\nfollowing stochastic nonlinearity,\n\nN(cid:0)h(cid:12)(cid:12)z, \u03c32(cid:1) I(\u03be1 \u2264 h \u2264 \u03be2)\n(cid:82) \u03be2\n\nN (h(cid:48) |z, \u03c32 ) dh(cid:48)\n\n(cid:0)h(cid:12)(cid:12)z, \u03c32(cid:1) ,\n\n(cid:44) N[\u03be1,\u03be2]\n\np(h|z, \u03be) =\n\n\u03be1\n\nwhere I(\u00b7) is an indicator function and N(cid:0)\u00b7(cid:12)(cid:12)z, \u03c32(cid:1) is the probability density function (PDF) of\n\na univariate Gaussian distribution with mean z and variance \u03c32; the shorthand notation N[\u03be1,\u03be2]\nindicates the density N is truncated and renormalized such that it is nonzero only in the interval\n[\u03be1, \u03be2]; \u03be (cid:44) {\u03be1, \u03be2} contains the truncation points and \u03c32 is \ufb01xed.\nThe units of a stochastic neural network fall into two categories: visible units and hidden units [4].\nThe network represents a joint distribution over both hidden and visible units and the hidden units are\nintegrated out to yield the marginal distribution of visible units. With a hidden unit expressed in (1),\nthe expectation of h is given by\n\n(1)\n\nE(h|z, \u03be) = z + \u03c3\n\n\u03c6( \u03be1\u2212z\n\u03a6( \u03be2\u2212z\n\n\u03c3 ) \u2212 \u03c6( \u03be2\u2212z\n\u03c3 )\n\u03c3 ) \u2212 \u03a6( \u03be1\u2212z\n\u03c3 )\n\n,\n\n(2)\n\nwhere \u03c6(\u00b7) and \u03a6(\u00b7) are, respectively, the PDF and cumulative distribution function (CDF) of the\nstandard normal distribution [16]. As will become clear below, a weighted sum of these expected\nhidden units constitutes the net contribution received by each visible unit when the hidden units are\nmarginalized out. Therefore E(h|z, \u03be) acts as a nonlinear activation function to map the incoming\ncontribution h receives to the outgoing contribution h sends out. The incoming contribution received\nby h may be a random variable or a function of data such as z = wT x + b; the former case is typically\nfor unsupervised learning and the latter case for supervised learning with x being the predictors.\nBy setting the truncation points to different values, we are able to realize many different kinds of\nnonlinearities. We plot in Figure 1 three realizations of E(h|z, \u03be) as a function of z, each with a\nparticular setting of {\u03be1, \u03be2} and \u03c32 = 0.2 in all cases. The plots of ReLU, tanh and sigmoid are\n\n2\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Illustration of different nonlinearities realized by the TruG with different truncation points.\n(a) \u03be1 = 0 and \u03be2 = +\u221e; (b) \u03be1 = \u22121 and \u03be2 = 1; (c) \u03be1 = 0 and \u03be2 = 1; (d) \u03be1 = 0 and \u03be2 = 4.\n\nalso shown as a comparison. It is seen from Figure 1 that, by choosing appropriate truncation points,\nE(h|z, \u03be) is able to approximate ReLU, tanh and sigmoid, the three types of nonlinearities most\nwidely used in neural networks. We can also realize other types of nonlinearities by setting the\ntruncation points to other values, as exempli\ufb01ed in Figure 1(d). The truncation points can be set\nmanually by hand, selected by cross-validation, or learned in the same way as the inter-unit weights.\nIn this paper, we focus on learning them alongside the weights based on training data.\nThe variance of h, given by [16],\n\n(cid:16) \u03be1\u2212z\n(cid:16) \u03be2\u2212z\nis employed in learning the truncation points and network weights. Direct evaluation of (2) and (3) is\n0, because both \u03c6(z) and \u03a6(z) are so close to 0 when z < \u221238 that\nprone to the numerical issue of 0\nthey are beyond the maximal accuracy a double \ufb02oat number can represent. We solve this problem by\nusing the fact that (2) and (3) can be equivalently expressed in terms of \u03c6(z)\n\u03a6(z) by dividing both the\nnumerator and the denominator by \u03c6(\u00b7). We make use of the following approximation for the ratio,\n\n(cid:17) \u2212 \u03be2\u2212z\n(cid:16) \u03be2\u2212z\n(cid:16) \u03be1\u2212z\n(cid:17) \u2212 \u03a6\n(cid:17)\n\n(cid:17) \u2212 \u03c6\n(cid:17) \u2212 \u03a6\n\nVar(h|z, \u03be) = \u03c32 + \u03c32\n\n(cid:16) \u03be2\u2212z\n(cid:16) \u03be1\u2212z\n\n(cid:16) \u03be1\u2212z\n(cid:16) \u03be2\u2212z\n\n\uf8eb\uf8ed \u03c6\n\n\uf8f6\uf8f82\n\n(cid:17)\n(cid:17)\n\n\u03be1\u2212z\n\n\u03c3 \u03c6\n\n\u2212 \u03c32\n\n\u03c3 \u03c6\n\n(cid:17)\n\n, (3)\n\n\u03c3\n\n\u03c3\n\n\u03c3\n\n\u03c3\n\n\u03c3\n\n\u03c3\n\n\u03a6\n\n\u03a6\n\n\u03c3\n\n\u03c3\n\n\u221a\n\nz2 + 4 \u2212 z\n\n2\n\n\u03c6(z)\n\u03a6(z)\n\n\u2248\n\n(cid:44) \u03b3(z),\n\n(cid:12)(cid:12)(cid:12)\u03b3(z)/ \u03c6(z)\n\nfor z < \u221238,\n\n(cid:12)(cid:12)(cid:12) < 2\n\nthe accuracy of which is established in Proposition 1.\n\u221a\nz2+4\u2212z\n\u221a\nProposition 1. The relative error is bounded by\nz2+8\u22123z\nz < \u221238, the relative error is guaranteed to be smaller than 4.8 \u00d7 10\u22127, that is,\n4.8 \u00d7 10\u22127 for all z < \u221238.\n\n\u03a6(z) \u2212 1\n\n\u22121; moreover, for all\n\n(cid:12)(cid:12)(cid:12)\u03b3(z)/ \u03c6(z)\n\n\u03a6(z) \u2212 1\n\n(cid:12)(cid:12)(cid:12) <\n\n(4)\n\n3 RBM with TruG Nonlinearity\n\nWe generalize the ReLU-based RBM in [5] by using the TruG nonlinearity. The resulting TruG-RBM\nis de\ufb01ned by the following joint distribution over visible units x and hidden units h,\n\np(x, h) =\n\ne\u2212E(x,h)I(x \u2208 {0, 1}n, \u03be1 \u2264 h \u2264 \u03be2),\n\n(5)\n2 hT diag(d)h \u2212 xT Wh \u2212 bT x \u2212 cT h is an energy function and Z is the\n\n1\nZ\n\nwhere E(x, h) (cid:44) 1\nnormalization constant. Proposition 2 shows (5) is a valid probability distribution.\nProposition 2. The distribution p(x, h) de\ufb01ned in (5) is normalizable.\n\nBy (5), the conditional distribution of x given h is still Bernoulli, p(x|h) =(cid:81)n\n\nwhile the conditional p(h|x) is a truncated normal distribution, i.e.,\n\ni=1 \u03c3([Wh + b]i),\n\np(h|x) =\n\nN[\u03be1,\u03be2]\n\n[WT x + c]j,\n\n1\ndj\n\n.\n\n(6)\n\n(cid:19)\n\nm(cid:89)\n\nj=1\n\n(cid:12)(cid:12)(cid:12)(cid:12) 1\n\ndj\n\n(cid:18)\n\nhj\n\n3\n\n-20-1001020z05101520activation(z)ReLUTruG-20-1001020z-1-0.500.51activation(z)tanhTruG-20-1001020z00.20.40.60.81activation(z)sigmoidTruG-20-1001020z0123456activation(z)sigmoidTruGReLU\fBy setting \u03be1 and \u03be2 to different values, we are able to produce different nonlinearities in (6).\n(cid:80)\nWe train a TruG-RBM based on maximizing the log-likelihood function (cid:96)(\u0398, \u03be) (cid:44)\n(cid:82) \u03be2\nx\u2208X ln p(x; \u0398, \u03be), where \u0398 (cid:44) {W, b, c} denotes the network weights, p(x; \u0398, \u03be) (cid:44)\np(x, h)dh is contributed by a single data point x, and X is the training dataset.\n\n\u03be1\n\n3.1 The Gradient w.r.t. Network Weights\n\n, where E[\u00b7] and E[\u00b7|x]\nThe gradient w.r.t. \u0398 is known to be \u2202ln p(x)\nmeans the expectation w.r.t. p(x, h) and p(h|x), respectively. If we estimate the gradient using a\nstandard sampling-based method, the variance associated with the estimate is usually very large. To\nreduce the variance, we follow the traditional RBM in applying the contrastive divergence (CD) to\nestimate the gradient [4]. Speci\ufb01cally, we approximate the gradient as\n\n\u2202\u0398\n\n(cid:105)\n\n\u2202\u0398 = E(cid:104) \u2202E(x,h)\n(cid:12)(cid:12)(cid:12)(cid:12) x(k)\n(cid:21)\n(cid:20)\u2202E(x, h)\n\n\u2202\u0398\n\n\u2202\u0398\n\n(cid:12)(cid:12)(cid:12) x\n(cid:105)\u2212E(cid:104) \u2202E(x,h)\n(cid:12)(cid:12)(cid:12)(cid:12) x\n(cid:21)\n\n(cid:20)\u2202E(x, h)\n\n\u2202\u0398\n\n,\n\n\u2212E\n\n\u2202 ln p(x)\n\n\u2202\u0398\n\n\u2248E\n\n(7)\n\n(8)\n\n(9)\n\nwhere x(k) is the k-th sample of the Gibbs sampler p(h(1)|x(0)), p(x(1)|h(1))\u00b7\u00b7\u00b7 p(x(k)|h(k)), with\nx(0) being the data x. As shown in (6), p(x|h) and p(h|x) are factorized Bernoulli and univariate\ntruncated normal distributions, for which ef\ufb01cient sampling algorithms exist [17, 18].\nFurthermore, we can obtain that \u2202E(x,h)\n\nThus estimation of the gradient with CD only requires E(cid:2)hj|x(s)(cid:3) and E(cid:2)h2\n\nj|x(s)(cid:3), which can be\n\ncalculated using (2) and (3). Using the estimated gradient, the weights can be updated using the\nstochastic gradient ascent algorithm or its variants.\n\n= hj and \u2202E(x,h)\n\n= xihj, \u2202E(x,h)\n\n= xi, \u2202E(x,h)\n\n= 1\n\n\u2202wij\n\nj.\n2 h2\n\n\u2202dj\n\n\u2202cj\n\n\u2202bi\n\n3.2 The Gradient w.r.t. Truncation Points\n\nThe gradient w.r.t. \u03be1 and \u03be2 are given by\n\nm(cid:88)\nm(cid:88)\n\nj=1\n\n\u2202 ln p(x)\n\n\u2202\u03be1\n\n\u2202 ln p(x)\n\n\u2202\u03be2\n\n=\n\n=\n\n(p(hj = \u03be1) \u2212 p(hj = \u03be1|x)) ,\n\n(p(hj = \u03be2|x) \u2212 p(hj = \u03be2)) ,\n\ndj\n\ndj\n\nj=1\n\nhj = \u03be\n\n(cid:12)(cid:12)(cid:12) 1\n\n[WT x + c]j, 1\ndj\n\n(cid:16)\nthe identity p(hj = \u03be) =(cid:80)\n\nfor a single data point, with the derivation provided in the Supplementary Material. It is known that\np(hj = \u03be|x) = N[\u03be1,\u03be2]\n, which can be easily calculated. However, if\nwe calculate p(hj = \u03be) directly, it would be computationally prohibitive. Fortunately, by noticing\nx p(hj = \u03be|x)p(x), we are able to estimate it ef\ufb01ciently with CD as\np(hj = \u03be) \u2248 p(hj = \u03be|x(k)) = N[\u03be1,\u03be2]\n, where x(k) is the k-th sample of\nthe Gibbs sampler as described above. Therefore, the gradient w.r.t. the lower and upper truncation\npoints can be estimated using the equations \u2202 ln p(x)\n\u2202 ln p(x)\n\n(cid:17)\n(cid:0)p(hj = \u03be2|x)\u2212p(hj = \u03be2|x(k))(cid:1) and\n(cid:0)p(hj = \u03be1|x)\u2212p(hj = \u03be1|x(k))(cid:1). After obtaining the gradients, we can update the\n\n\u2248\u2212(cid:80)m\n\n(cid:17)\n(cid:12)(cid:12)(cid:12) [WT x(k)+c]j\n\u2248(cid:80)m\n\nhj = \u03be\n\n(cid:16)\n\n, 1\ndj\n\nj=1\n\n\u2202\u03be2\n\n= (p(hj = \u03be2j|x) \u2212 p(hj = \u03be2j)) and \u2202 ln p(x)\n\ntruncation points with stochastic gradient ascent methods.\nIt should be emphasized that in the derivation above, we assume a common truncation point\npair {\u03be1, \u03be2} shared among all units for the clarity of presentation. The extension to separate\ntruncation points for different units is straightforward, by simply replacing (8) and (9) with\n= (p(hj = \u03be1j) \u2212 p(hj = \u03be1j|x)), where\n\u2202 ln p(x)\n\u03be1j and \u03be2j are the lower and upper truncation point of j-th unit, respectively. For the models\ndiscussed subsequently, one can similarly get the gradient w.r.t. unit-dependent truncations points.\nAfter training, due to the conditional independence between x and h and the existence of ef\ufb01cient\nsampling algorithm for truncated normal, samples can be drawn ef\ufb01ciently from the TruG-RBM\nusing the Gibbs sampler discussed below (7).\n\n\u2202\u03be1j\n\n\u2202\u03be2j\n\nj=1\n\n\u2202\u03be1\n\n4\n\n\f4 Temporal RBM with TruG Nonlinearity\n\np(X, H) = p(x1, h1)(cid:81)T\n\nWe integrate the TruG framework into the temporal RBM (TRBM) [19] to learn the probabilistic\nnonlinearity in sequential-data modeling. The resulting temporal TruG-RBM is de\ufb01ned by\n\nt=2 p(xt, ht|xt\u22121, ht\u22121),\n\n(10)\nwhere p(x1, h1) and p(xt, ht|xt\u22121, ht\u22121) are both represented by TruG-RBMs; xt \u2208 Rn and\nht \u2208 Rm are the visible and hidden variables at time step t, with X (cid:44) [x1, x2,\u00b7\u00b7\u00b7 , xT ]\nand H (cid:44) [h1, h2,\u00b7\u00b7\u00b7 , hT ].\nthe distribution p(xt, ht|xt\u22121, ht\u22121) is de-\nexp\u2212E(xt,ht) I(x \u2208 {0, 1}n, \u03be1 \u2264 ht \u2264 \u03be2),\n\ufb01ned as p(xt, ht|xt\u22121, ht\u22121) = 1\nt diag(d) ht \u2212\nwhere the energy function takes the form E(xt, ht) (cid:44) 1\n; and\n2xT\n\n(cid:17)\nt W1ht \u2212 2cT ht \u2212 2 (W2xt\u22121)T ht \u2212 2bT xt \u22122 (W3xt\u22121)T xt \u2212 2(W4ht\u22121)T ht\n\nt diag(a) xt + hT\nxT\n\nTo be speci\ufb01c,\n\n(cid:16)\n\nZt\n\n2\n\nZt (cid:44)(cid:82) +\u221e\n\n\u2212\u221e\n\n(cid:82) +\u221e\n\n0\n\ne\u2212E(xt,ht)dhtdxt.\n\nSimilar to the TRBM, directly optimizing the log-likelihood is dif\ufb01cult. We instead optimize the\nlower bound\n\nL (cid:44) Eq(H|X)[ln p(X, H; \u0398, \u03be) \u2212 ln q(H|X)] ,\n\n(11)\nwhere q(H|X) is an approximating posterior distribution of H. The lower bound is equal to the\nlog-likelihood when q(H|X) is exactly the true posterior p(H|X). We follow [19] to choose the\nfollowing approximate posterior,\n\nq(H|X) = p(h1|x1)\u00b7\u00b7\u00b7 p(hT|xT\u22121, hT\u22121, xT ),\n\n\u2202\u0398 = (cid:80)T\n(cid:105)(cid:17)\n(cid:104) \u2202E(xt,ht)\n\n(cid:16)Ep(xt,ht|xt\u22121,ht\u22121)\n\n(cid:104) \u2202E(xt,ht)\n\n(cid:105) \u2212\n\n\u2202\u0398\n\nt=1\n\nthe network\n\nthe gradient of the lower bound w.r.t.\nEp(ht\u22121|xt\u22122,ht\u22122,xt\u22121)\n\nwith which it can be shown that\nweights is given by \u2202L\nEp(ht|xt\u22121,ht\u22121,xt)\n. At any time step t, the outside expectation (which is over ht\u22121) is\napproximated by sampling from p(ht\u22121|xt\u22122, ht\u22122, xt\u22121); given ht\u22121 and xt\u22121, one can represent\np(xt, ht|xt\u22121, ht\u22121) as a TruG-RBM and therefore the two inside expectations can be computed in\nthe same way as in Section 3. In particular, the variables in ht are conditionally independent given\nj=1 p(hjt|xt\u22121, ht\u22121, xt) with each component\n\n(xt\u22121, ht\u22121, xt), i.e., p(ht|xt\u22121, ht\u22121, xt) = (cid:81)m\n(cid:12)(cid:12)(cid:12)(cid:12) [WT\n\np(hjt|xt\u22121, ht\u22121, xt) =N[\u03be1,\u03be2]\n\n1 xt+W2xt\u22121+W4ht\u22121+c]j\n\n.\n\n(12)\n\nequal to\n\n(cid:19)\n\n(cid:18)\n\n1\n,\ndj\n\nhjt\n\ndj\n\n\u2202\u0398\n\nSimilarly, the variables in xt are conditionally independent given (xt\u22121, ht\u22121, ht). As a result,\nEp(ht|xt\u22121,ht\u22121,xt)[\u00b7] can be calculated in closed-form using (2) and (3), and Ep(xt,ht|xt\u22121,ht\u22121,xt)[\u00b7]\ncan be estimated using the CD algorithm, as in Section Section 3. The gradient of L w.r.t. the upper\ntruncation point is\n\n(cid:20) T(cid:88)\n\nm(cid:88)\n\np(hjt = \u03be2|xt\u22121, ht\u22121, xt) \u2212 T(cid:88)\n\nm(cid:88)\n\n(cid:21)\n\np(hjt = \u03be2|xt\u22121, ht\u22121)\n\n,\n\nt=1\n\nj=1\n\n= Eq(H|X)\n\n\u2202L\n\u2202\u03be2\nwith \u2202L\napproach as described above for \u2202L\n\u2202\u0398.\n\nj=1\n\nt=1\n\n\u2202\u03be1\n\ntaking a similar form, where the expectations are similarly calculated using the same\n\n5 TGGM with TruG Nonlinearity\n\nWe generalize the feedforward TGGM model in [14] by replacing the probabilistic ReLU with the\nTruG. The resulting TruG-TGGM model is de\ufb01ned by the joint PDF over visible variables y and\nhidden variables h,\n\np(y, h|x) = N (y|W1h + b1, \u03c32I)N[\u03be1,\u03be2](h|W0x + b0, \u03c32I),\n\n(13)\n\n5\n\n\fgiven the predictor variables x. After marginalizing out h, we get the expectation of y as\n\n\u2202\u0398\n\nE[y|x] = W1E(h|W0x + b0, \u03be) + b1,\n\nln(cid:82) p(y, h|x; \u0398)dh, where \u0398 (cid:44) {W1, W0, b1, b0} represents the model parameters. By rewriting\n(cid:12)(cid:12)(cid:12)x, y\n(cid:12)(cid:12)(cid:12)x\n(cid:105)\u2212E(cid:104)\u2202E(y,h,x)\n(cid:105)\n\u2202\u0398 =E(cid:104)\u2202E(y,h,x)\n\n(14)\nwhere E(h|W0x + b0, \u03be) is given element-wisely in (2). It is then clear that the expectation of y\nis related to x through the TruG nonlinearity. Thus E[y|x] yields the same output as a three-layer\nperceptron that uses (2) to activate its hidden units. Hence, the TruG-TGGM model de\ufb01ned in (13)\ncan be understood as a stochastic perceptron with the TruG nonlinearity. By choosing different values\nfor the truncation points, we are able to realize different kinds of nonlinearities, including ReLU,\nsigmoid and tanh.\nTo train the model by maximum likelihood estimation, we need to know the gradient of ln p(y|x) (cid:44)\nthe joint PDF as p(y, h|x) \u221d e\u2212E(y,h,x)I(\u03be1 \u2264 h \u2264 \u03be2), the gradient is found to be given by\n\u2202 ln p(y|x)\n, where E(y, h, x) (cid:44) ||y\u2212W1h\u2212b1||2+||h\u2212W0x\u2212b0||2\n;\nE[\u00b7|x] is the expectation w.r.t. p(y, h|x); and E[\u00b7|x, y] is the expectation w.r.t. p(h|x, y). From\n(13), we know p(h|x) = N[\u03be1,\u03be2](h|W0x + b0, \u03c32I) can be factorized into a product of univariate\ntruncated Gaussian PDFs. Thus the expectation E[h|x] can be computed using (2). However, the\nexpectations E[h|x, y] and E[hhT|x, y] involve a multivariate truncated Gaussian PDF and are\nexpensive to calculate directly. Hence mean-\ufb01eld variational Bayesian analysis is used to compute\nthe approximate expectations. The details are similar to those in [14] except that (2) and (3) are used\nto calculate the expectation and variance of h.\nThe gradients of the log-likelihood w.r.t. the truncation points \u03be1 and \u03be2 are given by \u2202 ln p(y|x)\n\n=\nj=1 (p(hj = \u03be1|y, x) \u2212 p(hj = \u03be1|x))\nfor a single data point, with the derivation provided in the Supplementary Material. The probability\np(hj = \u03be1|x) can be computed directly since it is a univariate truncated Gaussian distribution. For\np(hj = \u03be2|y, x), we approximate it with the mean-\ufb01eld marginal distributions obtained above.\nAlthough TruG-TGGM involves random variables, thanks to the existence of close-form expression\nfor the expectation of univariate truncated normal, the testing is still very easy. Given a predictor \u02c6x,\nthe output can be simply predicted with the conditional expectation E[y|x] in (14).\n\n(cid:80)K\nj=1 (p(hj = \u03be2|y, x) \u2212 p(hj = \u03be2|x)) and \u2202 ln p(y|x)\n\n= \u2212(cid:80)K\n\n2\u03c32\n\n\u2202\u03be2\n\n\u2202\u03be1\n\n\u2202\u0398\n\n6 Experimental Results\n\nWe evaluate the performance bene\ufb01t brought about by the TruG framework when integrated into the\nRBM, temporal RBM and TGGM. In each of the three cases, the evaluation is based on comparing\nthe original network to the associated new network with the TruG nonlinearity. For the TruG, we\neither manually set {\u03be1, \u03be2} to particular values, or learn them automatically from data. We consider\nboth the case of learning a common {\u03be1, \u03be2} shared for all hidden units and the case of learning a\nseparate {\u03be1, \u03be2} for each hidden unit.\n\nModel\n\nTrun. Points\n\nTable 1: Averaged test log-probability on MNIST. ((cid:63))\nResults reported in [20]; ((cid:5)) Results reported in [21]\nusing RMSprop as the optimizer.\n\nResults of TruG-RBM The binarized\nMNIST and Caltech101 Silhouettes are con-\nsidered in this experiment. The MNIST\ncontains 60,000 training and 10,000 testing\nimages of hand-written digits, while Cal-\ntech101 Silhouettes includes 6364 training\nand 2307 testing images of objects\u2019 silhou-\nettes. For both datasets, each image has\n28 \u00d7 28 pixels [22]. Throughout this exper-\niment, 500 hidden units are used. RMSprop\nis used to update the parameters, with the\ndelay and mini-batch size set to 0.95 and\n100, respectively. The weight parameters are initialized with the Gaussian noise of zero mean and\n0.01 variance, while the lower and upper truncation points at all units are initialized to 0 and 1,\nrespectively. The learning rates for weight parameters are \ufb01xed to 10\u22124. Since truncations points\nin\ufb02uence the whole networks in a more fundamental way than weight parameters, it is observed\nthat smaller learning rates are often preferred for them. To balance the convergence speed and\n\nMNIST Caltech101\n-97.3\n-83.2\n-124.5\n-82.9\n-82.5\n-86.3(cid:63)\n\n-127.9\n-105.2\n-141.5\n-104.6\n-104.3\n-109.0(cid:5)\n\n[0, 1]\n[0, +\u221e)\n[-1, 1]\nc-Learn\ns-Learn\n\nAve. Log-prob\n\nTruG-RBM\n\nRBM\n\n\u2014\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) The learned nonlinearities in TruG models with shared upper truncation point \u03be; The\ndistribution of unit-level upper truncation points of TruG-RBM for (b) MNIST; (c) Caltech101\nSilhouettes.\nperformance, we anneal their learning rates from 10\u22124 to 10\u22126 gradually. The evaluation is based on\nthe log-probability averaged over test data points, which are estimated using annealed importance\nsampling (AIS) [23] with 5 \u00d7 105 inverse temperatures equally spaced in [0, 1]; the reported test\nlog-probability is averaged over 100 independent AIS runs.\nTo investigate the impact of truncation points, we \ufb01rst set the lower and upper truncation points\nto three \ufb01xed pairs: [0, 1], [0, +\u221e) and [\u22121, 1], which correspond to probabilistic approximations\nof sigmoid, ReLU and tanh nonlinearities, respectively. From Table 1, we see that the ReLU-type\nTruG-RBM performs much better than the other two types of TruG-RBM. We also learn the truncation\npoints from data automatically. We can see that the model bene\ufb01ts signi\ufb01cantly from nonlinearity\nlearning, and the best performance is achieved when the units learn their own nonlinearities. The\nlearned common nonlinearities (c-Learn) for different datasets are plotted in Figure 2(a), which shows\nthat the model always tends to choose a nonlinearity in between sigmoid and ReLU functions. For the\ncase with separate nonlinearities (s-Learn), the distributions of the upper truncation points in the TruG-\nRBM\u2019s for MNIST and Caltech101 Silhouettes are plotted in Figure 2(b) and (c), respectively. Note\nthat due to the detrimental effect observed for negative truncation points, here the lower truncation\npoints are \ufb01xed to zero and only the upper points are learned. To demonstrate the reliability of\nAIS estimate, the convergence plots of estimated log-probabilities are provided in Supplementary\nMaterial.\n\nResults of Temporal TruG-RBM The Bouncing Ball and CMU Motion Capture datasets are\nconsidered in the experiment with temporal models. Bouncing Ball consists of synthetic binary\nvideos of 3 bouncing balls in a box, with 4000 videos for training and 200 for testing, and each video\nhas 100 frames of size 30 \u00d7 30. CMU Motion Capture is composed of data samples describing the\njoint angles associated with different motion types. We follow [24] to train a model on 31 sequences\nand test the model on two testing sequences (one is running and the other is walking). Both the\noriginal TRBM and the TruG-TRBM use 400 hidden units for Bouncing Ball and 300 hidden units\nfor CMU Motion Capture. Stochastic gradient descent (SGD) is used to update the parameters,\nwith the momentum set to 0.9. The learning rates are set to be 10\u22122 and 10\u22124 for the two datasets,\nrespectively. The learning rate for truncation points is annealed gradually, as done in Section 6.\nSince calculating the log-probabilities for these temporal models is computationally prohibitive,\nprediction error is employed here as the performance evaluation criteria, which is widely used\n[24, 25] in temporal generative models. The performances averaged over 20 independent runs are\nreported here. Tables 2 and 3 con\ufb01rm again that models bene\ufb01t remarkably from nonlinearity learning,\nespecially in the case of learning a separate nonlinearity for each hidden unit. It is noticed that,\nalthough the ReLU-type TruG-TRBM performs better the tanh-type TruG-TRBM on Bouncing Ball,\nthe former performs much worse than the latter on CMU Motion Capture. This demonstrates that\na \ufb01xed nonlinearity cannot perform well on every dataset. However, by learning truncation points\nautomatically, the TruG can adapt the nonlinearity to the data and thus performs the best on every\ndataset (up to the representational limit of the TruG framework). Video samples drawn from the\ntrained models are provided in the Supplementary Material.\n\nResults of TruG-TGGM Ten datasets from the UCI repository are used in this experiment. Fol-\nlowing the procedures in [26], datasets are randomly partitioned into training and testing subsets for\n\n7\n\n-15-10-5051015Input before transform: 700.511.522.533.54Output after transformSigmoid function <(7)Nonlinearity in BallNonlinearity in MNISTNonlinearity in MotionNonlinearity in Caltech0.511.522.533.5Upper truncation point: 20.020.040.060.080.10.120.14Probability22.533.544.5Upper truncation point: 200.020.040.060.080.10.12Probability\fTable 2: Test prediction error on\nBouncing Ball. ((cid:63)) Taken from [24],\nin which 2500 hidden units are used.\n\nTable 3: Test prediction error on CMU Motion Cap-\nture, in which \u2018w\u2019 and \u2018r\u2019 mean walking and running,\nrespectively. ((cid:63)) Taken from [24].\n\nModel\n\nTruG-TRBM\n\nTRBM\nRTRBM(cid:63)\n\nTrun. Points\n[0, 1]\n[0, +\u221e)\n[-1, 1]\nc-Learn\ns-Learn\n\n\u2014\n\u2014\n\nPred. Err.\n6.38\u00b10.51\n4.16\u00b10.42\n6.01\u00b10.52\n3.82\u00b10.41\n3.66\u00b10.46\n4.90\u00b10.47\n4.00\u00b10.35\n\nModel\n\nTruG-TRBM\n\nTRBM\nss-SRTRBM(cid:63)\n\nTrun. Points\n[0, 1]\n[0, +\u221e)\n[-1, 1]\nc-Learn\ns-Learn\n\n\u2014\n\u2014\n\nErr. (w)\n8.2\u00b10.18\n21.8\u00b10.31\n7.3\u00b10.21\n6.7\u00b10.29\n6.8\u00b10.24\n9.6\u00b10.15\n8.1\u00b10.06\n\nErr. (r)\n6.1\u00b10.22\n14.9\u00b10.29\n5.9\u00b10.22\n5.5\u00b10.22\n5.4\u00b10.14\n6.8\u00b10.12\n5.9\u00b10.05\n\nTable 4: Averaged test RMSEs for multilayer perception (MLP) and TruG-TGGMs under different\ntruncation points. ((cid:63)) Results reported in [26], where BH, CS, EE, K8 NP, CPP, PS, WQR, YH, YPM\nare the abbreviations of Boston Housing, Concrete Strength, Kin8nm, Naval Propulsion, Cycle Power\nPlant, Protein Structure, Wine Quality Red, Yacht Hydrodynamic, Year Prediction MSD, respectively.\n\nTruG-TGGM with Different Trun. Points\n\nDataset MLP (ReLU)(cid:63)\n3.228 \u00b10.195\n5.977\u00b10.093\n1.098\u00b10.074\n0.091\u00b10.002\n0.001\u00b10.000\n4.182\u00b10.040\n4.539\u00b10.029\n0.645\u00b10.010\n1.182\u00b10.165\n8.932\u00b1N/A\n\nBH\nCS\nEE\nK8\nNP\nCPP\nPS\nWQR\nYH\nYPM\n\n[0, 1]\n\n3.564\u00b10.655\n5.210\u00b10.514\n1.168\u00b10.130\n0.094\u00b10.003\n0.002\u00b10.000\n4.023\u00b10.128\n4.231\u00b10.083\n0.662\u00b10.052\n0.871\u00b10.367\n8.961\u00b1N/A\n\n[0, +\u221e)\n\n3.214\u00b10.555\n5.106\u00b10.573\n1.252\u00b10.123\n0.086\u00b10.003\n0.002\u00b10.000\n4.067\u00b10.129\n4.387\u00b10.072\n0.644\u00b10.048\n0.821\u00b10.276\n8.985\u00b1N/A\n\n[-1, 1]\n\n4.003\u00b10.520\n4.977\u00b10.482\n1.069\u00b10.166\n0.091\u00b10.003\n0.002\u00b1 0.000\n3.978\u00b10.132\n4.262\u00b10.079\n0.659\u00b10.052\n0.846\u00b10.310\n8.859\u00b1N/A\n\nc-Learn\n\n3.401\u00b10.375\n4.910\u00b10.467\n0.881\u00b10.079\n0.073\u00b10.002\n0.001\u00b10.000\n3.952\u00b10.134\n4.209\u00b10.073\n0.645\u00b10.050\n0.803\u00b10.292\n8.893\u00b1N/A\n\ns-Learn\n\n3.622\u00b1 0.538\n4.743\u00b1 0.571\n0.913\u00b1 0.120\n0.075\u00b1 0.002\n0.001\u00b1 0.000\n3.951\u00b1 0.130\n4.206\u00b1 0.071\n0.643\u00b1 0.048\n0.793\u00b1 0.289\n8.965\u00b1 N/A\n\n10 trials except the largest one (Year Prediction MSD), for which only one partition is conducted\ndue to computational complexity. Table 4 summarizes the root mean square error (RMSE) averaged\nover the different trials. Throughout the experiment, 100 hidden units are used for the two datasets\n(Protein Structure and Year Prediction MSD), while 50 units are used for the remaining. RMSprop is\nused to optimize the parameters, with RMSprop delay set to 0.9. The learning rate is chosen from the\nset {10\u22123, 2 \u00d7 104, 10\u22124}, while the mini-batch size is set to 100 for the two largest datasets and 50\nfor the others. The number of VB cycles used in the inference is set to 10 for all datasets.\nThe RMSE\u2019s of TGGMs with \ufb01xed and learned truncation points are reported in Table 4, along\nwith the RMSE\u2019s of the (deterministic) multilayer perceptron (MLP) using ReLU nonlinearity for\ncomparison. Similar to what we have observed in generative models, the supervised models also\nbene\ufb01t signi\ufb01cantly from nonlinearity learning. The TruG-TGGM with learned truncation points\nperform the best for most datasets, with the separate learning performing slightly better than the\ncommon learning overall. Due to the limited space, the learned nonlinearities and their corresponding\ntruncation points are provided in Supplementary Material.\n7 Conclusions\nWe have presented a probabilistic framework, termed TruG, to unify ReLU, sigmoid and tanh, the\nmost commonly used nonlinearities in neural networks. The TruG is a family of nonlinearities\nconstructed with doubly truncated Gaussian distributions. The ReLU, sigmoid and tanh are three\nimportant members of the TruG family, and other members can be obtained easily by adjusting the\nlower and upper truncation points. A big advantage offered by the TruG is that the nonlinearity is\nlearnable from data, alongside the model weights. Due to its stochastic nature, the TruG can be\nreadily integrated into many stochastic neural networks for which hidden units are random variables.\nExtensive experiments have demonstrated signi\ufb01cant performance gains that the TruG framework\ncan bring about when it is integrated with the RBM, temporal RBM, or TGGM.\nAcknowledgements\nThe research reported here was supported by the DOE, NGA, NSF, ONR and by Accenture.\n\n8\n\n\fReferences\n\n257, 1991.\n\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[2] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251\u2013\n\n[3] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In\nProceedings of the 27th International Conference on Machine Learning (ICML-10), pages 807\u2013814, 2010.\n[4] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation,\n\n14(8):1771\u20131800, 2002.\n\n[5] Qinliang Su, Xuejun Liao, Chunyuan Li, Zhe Gan, and Lawrence Carin. Unsupervised learning with\ntruncated gaussian graphical models. In The Thirty-First National Conference on Arti\ufb01cial Intelligence\n(AAAI), 2016.\n\n[6] Forest Agostinelli, Matthew D. Hoffman, Peter J. Sadowski, and Pierre Baldi. Learning activation functions\n\nto improve deep neural networks. CoRR, 2014.\n\n[7] Carson Eisenach, Han Liu, and ZhaoranWang. Nonparametrically learning activation functions in deep\n\nneural nets. In Under review as a conference paper at ICLR, 2017.\n\n[8] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout\n\nnetworks. In International Conference on Machine Learning (ICML), 2013.\n\n[9] Caglar Gulcehre, Kyunghyun Cho, Razvan Pascanu, and Yoshua Bengio. Learned-norm pooling for deep\nfeedforward and recurrent neural networks. In Machine Learning and Knowledge Discovery in Databases,\npages 530\u2013546, 2014.\n\n[10] David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann\n\n[11] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets.\n\nmachines. Cognitive science, 9(1):147\u2013169, 1985.\n\nNeural computation, 18(7):1527\u20131554, 2006.\n\n[12] Radford M Neal. Connectionist learning of belief networks. Arti\ufb01cial intelligence, 56(1):71\u2013113, 1992.\n[13] Brendan J Frey. Continuous sigmoidal belief networks trained using slice sampling. In Advances in Neural\n\nInformation Processing Systems, pages 452\u2013458, 1997.\n\n[14] Qinliang Su, Xuejun Liao, Changyou Chen, and Lawrence Carin. Nonlinear statistical learning with\ntruncated gaussian graphical models. In Proceedings of the 33st International Conference on Machine\nLearning (ICML-16), 2016.\n\n[15] Ilya Sutskever, Geoffrey E Hinton, and Graham W. Taylor. The recurrent temporal restricted boltzmann\nmachine. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information\nProcessing Systems 21, pages 1601\u20131608. Curran Associates, Inc., 2009.\n\n[16] Norman L Johnson, Samuel Kotz, and Narayanaswamy Balakrishnan. Continuous univariate distributions,\n\nvol. 1-2, 1994.\n\n288, 2011.\n\n1995.\n\n[17] Nicolas Chopin. Fast simulation of truncated gaussian distributions. Statistics and Computing, 21(2):275\u2013\n\n[18] Christian P Robert. Simulation of truncated normal variables. Statistics and computing, 5(2):121\u2013125,\n\n[19] Ilya Sutskever and Geoffrey E Hinton. Learning multilevel distributed representations for high-dimensional\n\nsequences. In AISTATS, volume 2, pages 548\u2013555, 2007.\n\n[20] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In Proceedings\n\nof the 25th international conference on Machine learning, pages 872\u2013879. ACM, 2008.\n\n[21] David E Carlson, Edo Collins, Ya-Ping Hsieh, Lawrence Carin, and Volkan Cevher. Preconditioned spectral\ndescent for deep learning. In Advances in Neural Information Processing Systems, pages 2971\u20132979, 2015.\n[22] Benjamin M Marlin, Kevin Swersky, Bo Chen, and Nando D Freitas. Inductive principles for restricted\nboltzmann machine learning. In International conference on arti\ufb01cial intelligence and statistics, pages\n509\u2013516, 2010.\n\n[23] Radford M Neal. Annealed importance sampling. Statistics and Computing, 11(2):125\u2013139, 2001.\n[24] Roni Mittelman, Benjamin Kuipers, Silvio Savarese, and Honglak Lee. Structured recurrent temporal\nrestricted boltzmann machines. In Proceedings of the 31st International Conference on Machine Learning\n(ICML-14), pages 1647\u20131655, 2014.\n\n[25] Zhe Gan, Chunyuan Li, Ricardo Henao, David E Carlson, and Lawrence Carin. Deep temporal sigmoid\nbelief networks for sequence modeling. In Advances in Neural Information Processing Systems, pages\n2467\u20132475, 2015.\n\n[26] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato and Ryan P Adams. Probabilistic backpropagation for scalable learning of\nbayesian neural networks. Proceedings of The 32nd International Conference on Machine Learning, 2015.\n[27] Siamak Ravanbakhsh, Barnab\u00e1s P\u00f3czos, Jeff Schneider, Dale Schuurmans, and Russell Greiner. Stochastic\n\nneural networks with monotonic activation functions. AISTATS, 1050:14, 2016.\n\n[28] Max Welling, Michal Rosen-Zvi, and Geoffrey E Hinton. Exponential family harmoniums with an\n\napplication to information retrieval. In NIPS, pages 1481\u20131488, 2004.\n\n9\n\n\f[29] Qinliang Su and Yik-Chung Wu. On convergence conditions of gaussian belief propagation.\n\nIEEE\n\nTransactions on Signal Processing, 63(5):1144\u20131155, 2015.\n\n[30] Qinliang Su and Yik-Chung Wu. Convergence analysis of the variance in gaussian belief propagation.\n\nIEEE Transactions on Signal Processing, 62(19):5119\u20135131, 2014.\n\n[31] Brendan J Frey and Geoffrey E Hinton. Variational learning in nonlinear gaussian belief networks. Neural\n\nComputation, 11(1):193\u2013213, 1999.\n\n[32] Qinliang Su and Yik-Chung Wu. Distributed estimation of variance in gaussian graphical model via\nIEEE Transactions on Signal Processing,\n\nbelief propagation: Accuracy analysis and improvement.\n63(23):6258\u20136271, 2015.\n\n[33] Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameter-free training of\nmultilayer neural networks with continuous or discrete weights. In Advances in Neural Information\nProcessing Systems 27, pages 963\u2013971. Curran Associates, Inc., 2014.\n\n[34] Soumya Ghosh, Francesco Maria Delle Fave, and Jonathan Yedidia. Assumed density \ufb01ltering methods\nfor learning bayesian neural networks. In Proceedings of the Thirtieth AAAI Conference on Arti\ufb01cial\nIntelligence, AAAI\u201916, pages 1589\u20131595, 2016.\n\n10\n\n\f", "award": [], "sourceid": 2339, "authors": [{"given_name": "Qinliang", "family_name": "Su", "institution": "Duke University"}, {"given_name": "xuejun", "family_name": "Liao", "institution": null}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}