{"title": "Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2438, "page_last": 2446, "abstract": "We consider the estimation of an i.i.d.\\ vector $\\xbf \\in \\R^n$ from measurements $\\ybf \\in \\R^m$ obtained by a general cascade model consisting of a known linear transform followed by a probabilistic componentwise (possibly nonlinear) measurement channel. We present a method, called adaptive generalized approximate message passing (Adaptive GAMP), that enables joint learning of the statistics of the prior and measurement channel along with estimation of the unknown vector $\\xbf$. The proposed algorithm is a generalization of a recently-developed method by Vila and Schniter that uses expectation-maximization (EM) iterations where the posteriors in the E-steps are computed via approximate message passing. The techniques can be applied to a large class of learning problems including the learning of sparse priors in compressed sensing or identification of linear-nonlinear cascade models in dynamical systems and neural spiking processes. We prove that for large i.i.d.\\ Gaussian transform matrices the asymptotic componentwise behavior of the adaptive GAMP algorithm is predicted by a simple set of scalar state evolution equations. This analysis shows that the adaptive GAMP method can yield asymptotically consistent parameter estimates, which implies that the algorithm achieves a reconstruction quality equivalent to the oracle algorithm that knows the correct parameter values. The adaptive GAMP methodology thus provides a systematic, general and computationally efficient method applicable to a large range of complex linear-nonlinear models with provable guarantees.", "full_text": "Approximate Message Passing with Consistent\n\nParameter Estimation and Applications to Sparse\n\nLearning\n\nUlugbek S. Kamilov\n\nEPFL\n\nulugbek.kamilov@epfl.ch\n\nSundeep Rangan\n\nPolytechnic Institute of New York University\n\nsrangan@poly.edu\n\nAlyson K. Fletcher\n\nUniversity of California, Santa Cruz\n\nafletcher@soe.ucsc.edu\n\nMichael Unser\n\nEPFL\n\nmichael.unser@epfl.ch\n\nAbstract\n\nWe consider the estimation of an i.i.d. vector x \u2208 Rn from measurements y \u2208 Rm\nobtained by a general cascade model consisting of a known linear transform fol-\nlowed by a probabilistic componentwise (possibly nonlinear) measurement chan-\nnel. We present a method, called adaptive generalized approximate message pass-\ning (Adaptive GAMP), that enables joint learning of the statistics of the prior\nand measurement channel along with estimation of the unknown vector x. Our\nmethod can be applied to a large class of learning problems including the learn-\ning of sparse priors in compressed sensing or identi\ufb01cation of linear-nonlinear\ncascade models in dynamical systems and neural spiking processes. We prove\nthat for large i.i.d. Gaussian transform matrices the asymptotic componentwise\nbehavior of the adaptive GAMP algorithm is predicted by a simple set of scalar\nstate evolution equations. This analysis shows that the adaptive GAMP method\ncan yield asymptotically consistent parameter estimates, which implies that the\nalgorithm achieves a reconstruction quality equivalent to the oracle algorithm that\nknows the correct parameter values. The adaptive GAMP methodology thus pro-\nvides a systematic, general and computationally ef\ufb01cient method applicable to a\nlarge range of complex linear-nonlinear models with provable guarantees.\n\nIntroduction\n\n1\nConsider the estimation of a random vector x \u2208 Rn from a measurement vector y \u2208 Rm. As\nillustrated in Figure 1, the vector x, which is assumed to have i.i.d. components xj \u223c PX, is passed\nthrough a known linear transform that outputs z = Ax \u2208 Rm. The components of y \u2208 Rm are\ngenerated by a componentwise transfer function PY |Z. This paper addresses the cases where the\ndistributions PX and PY |Z have some parametric uncertainty that must be learned so as to properly\nestimate x.\nThis joint estimation and learning problem with linear transforms and componentwise nonlinearities\narises in a range of applications, including empirical Bayesian approaches to inverse problems in sig-\nnal processing, linear regression and classi\ufb01cation [1, 2], and, more recently, Bayesian compressed\n\n1\n\n\fFigure 1: Measurement model considered in this work. The vector x \u2208 Rn with an i.i.d. prior\nPX (x|\u03bbx) passes through the linear transform A \u2208 Rm\u00d7n followed by a componentwise nonlinear\nchannel PY |Z(y|z, \u03bbz) to result in y \u2208 Rm. The prior PX and the nonlinear channel PY |Z depend\non the unknown parameters \u03bbx and \u03bbz, respectively. We propose adaptive GAMP to jointly estimate\nx and (\u03bbx, \u03bbz) given the measurements y.\n\nsensing for estimation of sparse vectors x from underdetermined measurements [3\u20135]. Also, since\nthe parameters in the output transfer function PY |Z can model unknown nonlinearities, this problem\nformulation can be applied to the identi\ufb01cation of linear-nonlinear cascade models of dynamical\nsystems, in particular for neural spike responses [6\u20138].\nIn recent years, there has been considerable interest in so-called approximate message passing\n(AMP) methods for this estimation problem. The AMP techniques use Gaussian and quadratic\napproximations of loopy belief propagation (LBP) to provide estimation methods that are computa-\ntionally ef\ufb01cient, general and analytically tractable. However, the AMP methods generally require\nthat the distributions PX and PY |Z are known perfectly. When the parameters \u03bbx and \u03bbz are un-\nknown, various extensions have been proposed including combining AMP methods with Expecta-\ntion Maximization (EM) estimation [9\u201312] and hybrid graphical models approaches [13]. In this\nwork, we present a novel method for joint parameter and vector estimation called adaptive gen-\neralized AMP (adaptive GAMP), that extends the GAMP method of [14]. We present two major\ntheoretical results related to adaptive GAMP: We \ufb01rst show that, similar to the analysis of the stan-\ndard GAMP algorithm, the componentwise asymptotic behavior of adaptive GAMP can be exactly\ndescribed by a simple scalar state evolution (SE) equations [14\u201318]. An important consequence of\nthis result is a theoretical justi\ufb01cation to the EM-GAMP algorithm in [9\u201312] which is a special\ncase of adaptive GAMP with a particular choice of adaptation functions. Our second result demon-\nstrates the asymptotic consistency of adaptive GAMP when adaptation functions correspond to the\nmaximum-likelihood (ML) parameter estimation. We show that when the ML estimation is com-\nputed exactly, the estimated parameters converge to the true values and the performance of adaptive\nGAMP asymptotically coincides with the performance of the oracle GAMP algorithms that knows\ncorrect parameter values. Adaptive GAMP thus provides a computationally-ef\ufb01cient method for\nsolving a wide variety of joint estimation and learning problems with a simple, exact performance\ncharacterization and provable conditions for asymptotic consistency.\nAll proofs and some technical details that have been omitted for space appear in the full paper [19]\nthat also provides more background and simulations.\n\n2 Adaptive GAMP\n\nApproximate message passing (AMP) refers to a class of algorithms based on Gaussian approx-\nimations of loopy belief propagation (LBP) for the estimation of the vectors x and z according\nto the model described in Section 1. These methods originated from CDMA multiuser detection\nproblems in [15, 20, 21]; more recently, they have attracted considerable attention in compressed\nsensing [17, 18, 22]. The Gaussian approximations used in AMP are closely related to standard ex-\npectation propagation techniques [23, 24], but with additional simpli\ufb01cations that exploit the linear\ncoupling between the variables x and z. The key bene\ufb01ts of AMP methods are their computa-\ntional performance, their large domain of application, and, for certain large random A, their exact\nasymptotic performance characterizations with testable conditions for optimality [15\u201318]. This pa-\nper considers an adaptive version of the so-called generalized AMP (GAMP) method of [14] that\nextends the algorithm in [22] to arbitrary output distributions PY |Z.\nThe original GAMP algorithm of [14] requires that the distributions PX and PY |Z are known. We\npropose an adaptive GAMP, shown in Algorithm 1, to allow for simultaneous estimation of the\ndistributions PX and PY |Z along with the estimation of x and z. The algorithm assumes that distri-\nbutions PX and PY |Z have some parametric forms\n\nPX (x|\u03bbx), PY |Z(y|z, \u03bbz),\n\n(1)\n\n2\n\nComponentwiseOutput ChannelAvailableMeasurementsUnknowni.i.d. signalMixing MatrixUnknownLinearMeasurementsSignal Prior\fsequence of estimates(cid:98)xt and(cid:98)zt for x and z along with parameter estimates(cid:98)\u03bbt\n\nfor parameters \u03bbx \u2208 \u039bx and \u03bbz \u2208 \u039bz and for parameter sets \u039bx and \u039bz. Algorithm 1 produces a\nz, The precise\nvalue of these estimates depends on several factors in the algorithm including the termination criteria\nand the choice of what we will call estimation functions Gt\ns, and adaptation functions\nH t\n\nx and(cid:98)\u03bbt\n\nz and Gt\n\nx, Gt\n\nx and H t\nz.\n\nAlgorithm 1 Adaptive GAMP\nRequire: Matrix A, estimation functions Gt\n\n1: Initialize t \u2190 0, s\u22121 \u2190 0 and some values for(cid:98)x0, \u03c4 0\n\nx, Gt\n\ns and Gt\nz and adaptation functions H t\nx,\n\nx and H t\nz.\n\nF \u03c4 t\n\nx/m\n\np\nz(pt, y, \u03c4 t\np)\ni, yi, \u03c4 t\nz(pt\ns(pt\ni, yi, \u03c4 t\n\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17: until Terminated\n\n{Output node update}\npt \u2190 A(cid:98)xt \u2212 st\u22121\u03c4 t\np \u2190 (cid:107)A(cid:107)2\n\u03c4 t\n(cid:98)\u03bbt\np,(cid:98)\u03bbt\n(cid:98)zt\nz \u2190 H t\np,(cid:98)\u03bbt\ni \u2190 Gt\ns \u2190 \u2212(1/m)(cid:80)\ni \u2190 Gt\nst\n\u03c4 t\n{Input node update}\nr \u2190 (cid:107)A(cid:107)2\nF \u03c4 t\n1/\u03c4 t\n(cid:98)\u03bbt\ns/n\nrATst\nrt = xt + \u03c4 t\n(cid:98)xt+1\nx \u2190 H t\nx(rt, \u03c4 t\nr)\nj \u2190 Gt\nx(rt\nj, \u03c4 t\nx \u2190 (\u03c4 t\n\u03c4 t+1\n\nr,(cid:98)\u03bbt\nr/n)(cid:80)\n\nz) for all i = 1, . . . , m\nz) for all i = 1, . . . , m\nz)/\u2202pt\ns(pt\ni\n\np,(cid:98)\u03bbt\n\ni, yi, \u03c4 t\n\ni \u2202Gt\n\nx) for all j = 1, . . . , n\n\nj \u2202Gt\n\nx(rt\n\nj, \u03c4 t\n\nx)/\u2202rj\n\nr,(cid:98)\u03bbt\n\nx, Gt\n\nz, and Gt\n\nThe choice of the estimation and adaptation functions allows for considerable \ufb02exibility in the algo-\ns can be selected such that the GAMP\nrithm. For example, it is shown in [14] that Gt\nalgorithm implements Gaussian approximations of either max-sum LBP or sum-product LBP that\napproximate the maximum-a-posteriori (MAP) or minimum-mean-squared-error (MMSE) estimates\nof x given y, respectively. The adaptation functions can also be selected for a number of different\nparameter-estimation strategies. Because of space limitation, we present only the estimation func-\ntions for the sum-product GAMP algorithm from [14] along with an ML-type adaptation. Some of\nthe analysis below, however, applies more generally.\nAs described in [14], the sum-product estimation can be implemented with the functions\n\nGt\n\nx(r, \u03c4r,(cid:98)\u03bbx)\nz(p, y, \u03c4p,(cid:98)\u03bbz)\ns(p, y, \u03c4p,(cid:98)\u03bbz)\n\nGt\n\nGt\n\n(cid:16)\n\n:= E[X|R = r, \u03c4r,(cid:98)\u03bbx],\n:= E[Z|P = p, Y = y, \u03c4p,(cid:98)\u03bbz],\n(cid:17)\nz(p, y, \u03c4p,(cid:98)\u03bbz) \u2212 p\nVx \u223c N (0, \u03c4r), X \u223c PX (\u00b7|(cid:98)\u03bbx),\nY \u223c PY |Z(\u00b7|Z,(cid:98)\u03bbz).\n\n1\n\u03c4p\n\nGt\n\n:=\n\n,\n\nwhere the expectations are with respect to the scalar random variables\n\nVz \u223c N (0, \u03c4p),\n\nR = X + Vx,\nZ = P + Vz,\n\n(i.e. ((cid:98)\u03bbx,(cid:98)\u03bbz) = (\u03bbx, \u03bbz)), the outputs(cid:98)xt and(cid:98)zt can be interpreted as sum products estimates of\n\n(3a)\n(3b)\nThe estimation functions (2) correspond to scalar estimates of random variables in additive white\nGaussian noise (AWGN). A key result of [14] is that, when the parameters are set to the true values\nthe conditional expectations E(x|y) and E(z|y). The algorithm thus reduces the vector-valued\nestimation problem to a computationally simple sequence of scalar AWGN estimation problems\nalong with linear transforms.\nThe estimation functions H t\nx and H t\nand \u03bbz. In the special case when H t\n\nz in Algorithm 1 produce the estimates for the parameters \u03bbx\nx and H t\n\nz produce \ufb01xed outputs\nr) = \u03bb\n\nt\nz, H t\np) = \u03bb\n\nx(rt, \u03c4 t\n\nt\nx,\n\nH t\n\nz(pt, yt, \u03c4 t\n\n(2a)\n(2b)\n\n(2c)\n\n3\n\n\ft\nt\nfor pre-computed values of \u03bb\nx, the adaptive GAMP algorithm reduces to the standard (non-\nz and \u03bb\nadaptive) GAMP algorithm of [14]. The non-adaptive GAMP algorithm can be used when the\nparameters \u03bbx and \u03bbz are known.\nWhen the parameters \u03bbx and \u03bbz are unknown, it has been proposed in [9\u201312] that they can be\nestimated via an EM method that exploits that fact that GAMP provides estimates of the posterior\ndistributions of x and z given the current parameter estimates. As described in the full paper [19],\nthis EM-GAMP method corresponds to a special case of the Adaptive GAMP method for a particular\nchoice of the adaptation functions H t\nHowever, in this work, we consider an alternate parameter estimation method based on ML adapta-\ntion. The ML adaptation uses the following fact that we will rigorously justify below: For certain\nlarge random A, at any iteration t, the components of the vectors rt and the joint vectors (pt, yt)\nwill be distributed as\n\nx and H t\nz.\n\nVx \u223c N (0, \u03ber), X \u223c PX (\u00b7|\u03bb\u2217\nx),\n\nx and \u03bb\u2217\n\n(Z, P ) \u223c N (0, Kp),\n\nR = \u03b1rX + Vx,\nZ = P + Vz,\n\n(4a)\n(4b)\nwhere \u03bb\u2217\nz are the \u201ctrue\u201d parameters and the scalars \u03b1r and \u03ber and the covariance matrix Kp\nare some parameters that depend on the estimation and adaptation functions used in the previous\niterations. Remarkably, the distributions of the components of rt and (pt, yt) will follow (4) even\nif the estimation functions in the iterations prior to t used the incorrect parameter values. The\nadaptive GAMP algorithm can thus attempt to estimate the parameters via a maximum likelihood\n(ML) estimation:\n\nY \u223c PY |Z(\u00b7|Z, \u03bb\u2217\nz),\n\nH t\n\nx(rt, \u03c4 t\nr)\n\n:= arg max\n\u03bbx\u2208\u039bx\n\nmax\n\n(\u03b1r,\u03ber)\u2208Sx(\u03c4 t\nr )\n\nH t\n\nz(pt, y, \u03c4 t\np)\n\n:= arg max\n\u03bbz\u2208\u039bz\n\nmax\n\nKp\u2208Sz(\u03c4 t\np)\n\n\uf8f1\uf8f2\uf8f3 1\nn\u22121(cid:88)\nm\u22121(cid:88)\n\nj=0\n\nn\n\n1\nm\n\ni=0\n\n(cid:40)\n\n\u03c6x(rt\n\nj, \u03bbx, \u03b1r, \u03ber)\n\n(cid:41)\n\n\u03c6z(pt\n\ni, yi, Kp)\n\n,\n\n\uf8fc\uf8fd\uf8fe ,\n\nwhere Sx and Sz are sets of possible values for the parameters \u03b1r, \u03ber and Kp, \u03c6x and \u03c6z are the\nlog-likelihoods\n\n\u03c6x(r, \u03bbx, \u03b1r, \u03ber) = log pR(r|\u03bbx, \u03b1r, \u03ber),\n\u03c6z(p, y, \u03bbz, Kp) = log pP,Y (p, y|\u03bbz, Kp)\n\nand pR and pP,Y are the probability density functions corresponding to the distributions in (4).\n\n3 Convergence and Asymptotic Consistency with Gaussian Transforms\n\n3.1 General State Evolution Analysis\n\nBefore proving the asymptotic consistency of the adaptive GAMP method with ML adaptation, we\n\ufb01rst prove a more general convergence result. Among other consequences, the result will justify\nthe distribution model (4) assumed by the ML adaptation. Similar to the SE analyses in [14, 18]\nwe consider the asymptotic behavior of the adaptive GAMP algorithm with large i.i.d. Gaussian\nmatrices. The assumptions are summarized as follows. Details can be found in the full paper [19,\nAssumption 2].\n\nAssumption 1 Consider the adaptive GAMP algorithm running on a sequence of problems indexed\nby the dimension n, satisfying the following:\n\n(a) For each n, the matrix A \u2208 Rm\u00d7n has i.i.d. components with Aij \u223c N (0, 1/m) and the\ndimension m = m(n) is a deterministic function of n satisfying n/m \u2192 \u03b2 for some \u03b2 > 0\nas n \u2192 \u221e.\n\n(b) The input vectors x and initial condition(cid:98)x0 are deterministic sequences whose components\n\nconverge empirically with bounded moments of order s = 2k \u2212 2 as\n\n(5a)\n\n(5b)\n\n(6a)\n(6b)\n\n(7)\n\nn\u2192\u221e(x,(cid:98)x0)\n\nlim\n\n= (X, (cid:98)X 0),\n\nPL(s)\n\n4\n\n\fto some random vector (X, (cid:98)X 0) for k = 2. See [19] for a precise statement of this type of\n\nconvergence.\n\n(c) The output vectors z and y \u2208 Rm are generated by\n\n(8)\nfor some scalar function h(z, w) where the disturbance vector w is deterministic, but em-\npirically converges as\n\nz = Ax, y = h(z, w),\n\n(9)\nwith s = 2k\u22122, k = 2 and W is some random variable. We let PY |Z denote the conditional\ndistribution of the random variable Y = h(Z, W ).\n\nlim\nn\u2192\u221e w\n\n= W,\n\nPL(s)\n\n(d) Suitable continuity assumptions on the estimation functions Gt\n\nx, Gt\n\nz and Gt\n\ns and adaptation\n\nfunctions H t\n\nx and H t\n\nz \u2013 see [19] for details.\n\nx := {(xj, rt\n\u03b8t\n\nNow de\ufb01ne the sets of vectors\n\nj,(cid:98)xt+1\nj ), j = 1, . . . , n},\n\nz := {(zi,(cid:98)zt\nadaptive GAMP estimate(cid:98)xt as well as rt. The second vector, \u03b8t\n\u201ctrue,\u201d but unknown, output vector z, its GAMP estimate(cid:98)zt, as well as pt and the observed input y.\n\nx, represents the components of the the \u201ctrue,\u201d but unknown, input vector x, its\nz, contains the components of the\n\ni), i = 1, . . . , m}.\n\nThe \ufb01rst vector set, \u03b8t\n\ni , yi, pt\n\n(10)\n\n\u03b8t\n\nx and \u03b8t\n\nThe sets \u03b8t\nz are implicitly functions of the dimension n. Our main result, Theorem 1 below,\ncharacterizes the asymptotic joint distribution of the components of these two sets as n \u2192 \u221e.\nSpeci\ufb01cally, we will show that the empirical distribution of the components of \u03b8t\nz converge\nto a random vectors of the form\n\nx and \u03b8t\n\nwhere X is the random variable in the initial condition (7). Rt and (cid:98)X t+1 are given by\n\n\u03b8\n\n\u03b8\n\n(11)\n\nx := (X, Rt, (cid:98)X t+1),\n\nt\n\nt\n\nz := (Z,(cid:98)Z t, Y, P t),\n(cid:98)X t+1 = Gt\n\nRt = \u03b1t\n\nrX + V t,\n\nfor some deterministic constants \u03b1t\n(Z, P t) \u223c N (0, Kt\n\np), and\n\nr, \u03bet\n\nV t \u223c N (0, \u03bet\nr),\nt\nr and \u03bb\nx that will be de\ufb01ned momentarily. Similarly,\n\nx(Rt, \u03c4 t\n\nt\nr, \u03bb\nx)\n\nr, \u03c4 t\n\n(12)\n\nY \u223c PY |Z(\u00b7|Z),\nt\np and \u03bb\nwhere W is the random variable in (9) and Kt\nz are also deterministic constants. The determin-\nistic constants above can be computed iteratively with the following state evolution (SE) equations\nshown in Algorithm 2.\n\nz(P t, Y, \u03c4 t\n\n(13)\n\nt\nz),\n\np, \u03bb\n\n(cid:98)Z t = Gt\n\nTheorem 1 Consider the random vectors \u03b8t\nsumption 1. Let \u03b8\nequations in Algorithm 2. Then, for any \ufb01xed t, almost surely, the components of \u03b8t\nempirically with bounded moments of order k = 2 as\n\nz generated by the outputs of GAMP under As-\nt\nz be the random vectors in (11) with the parameters determined by the SE\nz converge\n\nt\nx and \u03b8\n\nx and \u03b8t\n\nx and \u03b8t\n\nn\u2192\u221e \u03b8t\nlim\n\nx\n\nPL(k)\n\n= \u03b8\n\nt\nx,\n\nn\u2192\u221e \u03b8t\nlim\n\nz\n\nPL(k)\n\n= \u03b8\n\nt\nz.\n\nwhere \u03b8\n\nt\nx and \u03b8\n\nt\nz are given in (11). In addition, for any t, the limits\n\nt\n\u03bbt\nx = \u03bb\nx,\n\nlim\nn\n\n\u03bbt\nz = \u03bb\n\nt\nz,\n\nlim\nn\n\nr = \u03c4 t\n\u03c4 t\nr,\n\nlim\nn\n\np = \u03c4 t\n\u03c4 t\np,\n\nlim\nn\n\nalso hold almost surely.\n\n(17)\n\n(18)\n\nSimilar to several other analyses of AMP algorithms such as [14\u201318], the theorem provides a scalar\nequivalent model for the componentwise behavior of the adaptive GAMP method. That is, asymp-\ntotically the components of the sets \u03b8t\nz in (10) are distributed identically to simple scalar\nrandom variables. The parameters in these random variables can be computed via the SE equations\n\nx and \u03b8t\n\n5\n\n\fAlgorithm 2 Adaptive GAMP State Evolution\nGiven the distributions in Assumption 1, compute the sequence of parameters as follows:\n\n\u2022 Initialization: Set t = 0 with\n\nx = cov(X, (cid:98)X 0),\n\nwhere the expectation is over the random variables (X, (cid:98)X 0) in Assumption 1(b) and \u03c4 0\n\nx = \u03c4 0\n\u03c4 0\nx ,\n\nK0\n\n(14)\n\nx is\n\nthe initial value in the GAMP algorithm.\n\n\u2022 Output node update: Compute the variables associated with \u03b8\n\n\u03c4 t\np = \u03b2\u03c4 t\nx,\nr = \u2212E\u22121\n\u03c4 t\n\n\u03b1t\nr = \u03c4 t\nr\n\nE\n\n(cid:20) \u2202\n(cid:20) \u2202\n\n\u2202p\n\n\u2202z\n\nKt\n\np = \u03b2Kt\nx,\n\nGt\n\ns(P t, Y, \u03c4 t\n\ns((cid:98)P , h(z, W ), \u03c4 t\n\nGt\n\nt\np, \u03bb\nz)\n\n(cid:21)\n\n\u03bb\n\nt\nz = H t\nt\np, \u03bb\nz)\n\n,\n\nt\nz:\n\nr)2E(cid:104)\n\nz(P t, \u03c4 t\n\np),\n\n\u03bet\nr = (\u03c4 t\n\n(cid:21)\n\n(cid:12)(cid:12)(cid:12)(cid:12)z=Z\n\n.\n\nwhere the expectations are over the random variables (P t, Y, W ).\n\n\u2022 Input node update: Compute the variables associated with \u03b8\n\nt\nx:\n\n(cid:20) \u2202\n\n\u03bb\n\nt\nx(Rt, \u03c4 t\nx = H t\nE\n\n= \u03c4 t\nr\n\n\u03c4 t+1\nx\n\nr),\n\n(cid:21)\n\nx = cov(X, (cid:98)X t+1),\n\nwhere the expectation is over the random variable (X, (cid:98)X t+1).\n\n\u2202r\n\nGt\n\nx(Rt, \u03c4 t\n\nt\nx)\nr, \u03bb\n\n, Kt+1\n\nGt\n\ns(P t, Y, \u03c4 t\n\nt\np, \u03bb\nz)\n\n(cid:105)\n\n(15a)\n\n, (15b)\n\n(15c)\n\n(16a)\n\n(16b)\n\n(14), (15) and (16), which can be evaluated with one or two-dimensional integrals. From this scalar\nequivalent model, one can compute a large class of componentwise performance metrics such as\nmean-squared error (MSE) or detection error rates. Thus, the SE analysis shows that for, essentially\narbitrary estimation and adaptation functions, and distributions on the true input and disturbance, we\ncan exactly evaluate the asymptotic behavior of the adaptive GAMP algorithm. In addition, when\nthe parameter values \u03bbx and \u03bbz are \ufb01xed, the SE equations in Algorithm 2 reduce to SE equations\nfor the standard (non-adaptive) GAMP algorithm described in [14].\n\n3.2 Asymptotic Consistency with ML Adaptation\n\nThe general result, Theorem 1, can be applied to the adaptive GAMP algorithm with arbitrary es-\ntimation and adaptation function. In particular, the result can be used to rigorously justify the SE\nanalysis of the EM-GAMP presented in [11, 12]. Here, we use the result to prove the asymptotic\nparameter consistency of Adaptive GAMP with ML adaptation. The key point is to realize that\nthe distributions (12) and (13) exactly match the distributions (4) assumed by the ML adaptation\nfunctions (5). Thus, the ML adaptation should work provided that the maximizations in (5) yield\nthe correct parameter estimates. This condition is essentially an identi\ufb01ability requirement that we\nmake precise with the following de\ufb01nitions.\nDe\ufb01nition 1 Consider a family of distributions, {PX (x|\u03bbx), \u03bbx \u2208 \u039bx}, a set Sx of parameters\n(\u03b1r, \u03ber) of a Gaussian channel and function \u03c6x(r, \u03bbx, \u03b1r, \u03ber). We say that PX (x|\u03bbx) is identi\ufb01able\nwith Gaussian outputs with parameter set Sx and function \u03c6x if:\n\n(a) The sets Sx and \u039bx are compact.\n(b) For any \u201ctrue\u201d parameters \u03bb\u2217\n\n(cid:98)\u03bbx = arg max\n\nx \u2208 \u039bx, and (\u03b1r, \u03ber) \u2208 Sx, the maximization\nr, \u03be\u2217\nx, \u03b1\u2217\nr ] ,\nmax\n\nrX + V, \u03bbx, \u03b1r, \u03ber)|\u03bb\u2217\n\nE [\u03c6x(\u03b1\u2217\n\n(19)\n\nis well-de\ufb01ned, unique and returns the true value,(cid:98)\u03bbx = \u03bb\u2217\n\n(\u03b1r,\u03ber)\u2208Sx\n\n\u03bbx\u2208\u039bx\n\nrespect to X \u223c PX (\u00b7|\u03bb\u2217\n\nx) and V \u223c N (0, \u03be\u2217\nr ).\n\nx. The expectation in (19) is with\n\n6\n\n\f(c) Suitable continuity assumptions \u2013 see [19] for details.\n\nDe\ufb01nition 2 Consider a family of conditional distributions, {PY |Z(y|z, \u03bbz), \u03bbz \u2208 \u039bz} generated\nby the mapping Y = h(Z, W, \u03bbz) where W \u223c PW is some random variable and h(z, w, \u03bbz) is\na scalar function. Let Sz be a set of covariance matrices Kp and let \u03c6z(y, p, \u03bbz, Kp) be some\nfunction. We say that conditional distribution family PY |Z(\u00b7|\u00b7, \u03bbz) is identi\ufb01able with Gaussian\ninputs with covariance set Sz and function \u03c6z if:\n\n(a) The parameter sets Sz and \u039bz are compact.\n(b) For any \u201ctrue\u201d parameter \u03bb\u2217\n\n(cid:98)\u03bbz = arg max\n\nz \u2208 \u039bz and true covariance K\u2217\nis well-de\ufb01ned, unique and returns the true value,(cid:98)\u03bbz = \u03bb\u2217\nz) and (Z, P ) \u223c N (0, K\u2217\np).\n\nrespect to Y |Z \u223c PY |Z(y|z, \u03bb\u2217\n\nE(cid:2)\u03c6z(Y, P, \u03bbz, Kp)|\u03bb\u2217\n\nmax\nKp\u2208Sz\n\n\u03bbz\u2208\u039bz\n\np, the maximization\n\n(cid:3) ,\n\nz, K\u2217\n\np\n\nz, The expectation in (20) is with\n\n(20)\n\n(c) Suitable continuity assumptions \u2013 see [19] for details.\n\nDe\ufb01nitions 1 and 2 essentially require that the parameters \u03bbx and \u03bbz can be identi\ufb01ed through a\nmaximization. The functions \u03c6x and \u03c6z can be the log likelihood functions (6a) and (6b), although\nwe permit other functions as well. See [19] for further discussion of the likelihood functions as well\nas the choice of the parameter sets Sx and Sz.\nTheorem 2 Let PX (\u00b7|\u03bbx) and PY |Z(\u00b7|\u00b7, \u03bbz) be families of input and output distributions that are\nidenti\ufb01able in the sense of De\ufb01nitions 1 and 2. Consider the outputs of the adaptive GAMP algo-\nrithm using the ML adaptation functions (5) using the functions \u03c6x and \u03c6z and parameter sets in\nDe\ufb01nitions 1 and 2. In addition, suppose Assumption 1(a) to (c) hold where the distribution of X is\ngiven by PX (\u00b7|\u03bb\u2217\nx \u2208 \u039bx and the conditional distribution of Y given\nx) for some \u201ctrue\u201d parameter \u03bb\u2217\nz \u2208 \u039bz. Then, under suitable continuity\nZ is given by PY |Z(y|z, \u03bb\u2217\nconditions (see [19] for details), for any \ufb01xed t,\n\nz) for some \u201ctrue\u201d parameter \u03bb\u2217\n\n(a) The components of \u03b8t\n\nx and \u03b8t\n\nz in (10) converge empirically with bounded moments of order\n\nk = 2 as in (17) and the limits (18) hold almost surely.\n\nr) for some t, then limn\u2192\u221e(cid:98)\u03bbt\np) for some t, then limn\u2192\u221e(cid:98)\u03bbt\n\nt\nz = \u03bb\n\nr) \u2208 Sx(\u03c4 t\n\nr, \u03bet\np \u2208 Sz(\u03c4 t\n\n(b) If (\u03b1t\n\n(c) If Kt\n\nt\n\nx almost surely.\n\nx = \u03bb\nz = \u03bb\u2217\n\nx = \u03bb\u2217\nz almost surely.\n\nThe theorem shows, remarkably, that for a very large class of the parameterized distributions, the\nadaptive GAMP algorithm with ML adaptation is able to asymptotically estimate the correct param-\neters. Also, once the consistency limits in (b) and (c) hold, the SE equations in Algorithm 2 reduce\nto the SE equations for the non-adaptive GAMP method running with the true parameters. Thus,\nwe conclude there is asymptotically no performance loss between the adaptive GAMP algorithm\nand a corresponding oracle GAMP algorithm that knows the correct parameters in the sense that the\nempirical distributions of the algorithm outputs are described by the same SE equations.\n\n4 Numerical Example: Estimation of a Gauss-Bernoulli input\n\nRecent results suggest that there is considerable value in learning of priors PX in the context of\ncompressed sensing [25], which considers the estimation of sparse vectors x from underdetermined\nmeasurements (m < n) . It is known that estimators such as LASSO offer certain optimal min-max\nperformance over a large class of sparse distributions [26]. However, for many particular distribu-\ntions, there is a potentially large performance gap between LASSO and MMSE estimator with the\ncorrect prior. This gap was the main motivation for [9, 10] which showed large gains of the EM-\nGAMP method due to its ability to learn the prior. Here, we present a simple simulation to illustrate\nthe performance gain of adaptive GAMP and its asymptotic consistency. Speci\ufb01cally, Fig. 2 com-\npares the performance of adaptive GAMP for estimation of a sparse Gauss-Bernoulli signal x \u2208 Rn\nfrom m noisy measurements\n\ny = Ax + w,\n\n7\n\n\fFigure 2: Reconstruction of a Gauss-Bernoulli signal from noisy measurements. The average recon-\nstruction MSE is plotted against (a) measurement ratio m/n and (b) AWGN variance \u03c32. The plots\nillustrate that adaptive GAMP yields considerable improvement over (cid:96)1-based LASSO estimator.\nMoreover, it exactly matches the performance of oracle GAMP that knows the prior parameters.\n\nwhere the additive noise w is random with i.i.d. entries wi \u223c N (0, \u03c32). The signal of length\nn = 400 has 20% nonzero components drawn from the Gaussian distribution of variance 5. Adap-\ntive GAMP uses EM iterations, which are used to approximate ML parameter estimation, to jointly\nx = 5). The performance of\nrecover the unknown signal x and the true parameters \u03bbx = (\u03c1 = 0.2, \u03c32\nadaptive GAMP is compared to that of LASSO with MSE optimal regularization parameter, and or-\nacle GAMP that knows the parameters of the prior exactly. For generating the graphs, we performed\n1000 random trials by forming the measurement matrix A from i.i.d. zero-mean Gaussian random\nvariables of variance 1/m. In Figure 2(a), we keep the variance of the noise \ufb01xed to \u03c32 = 0.1 and\nplot the average MSE of the reconstruction against the measurement ratio m/n. In Figure 2(b), we\nkeep the measurement ratio \ufb01xed to m/n = 0.75 and plot the average MSE of the reconstruction\nagainst the noise variance \u03c32. For completeness, we also provide the asymptotic MSE values com-\nputed via SE recursion. The results illustrate that GAMP signi\ufb01cantly outperforms LASSO over the\nwhole range of m/n and \u03c32. Moreover, the results corroborate the consistency of adaptive GAMP\nwhich achieves nearly identical quality of reconstruction with oracle GAMP. The performance re-\nsults here and in [19] indicate that adaptive GAMP can be an effective method for estimation when\nthe parameters of the problem are dif\ufb01cult to characterize and must be estimated from data.\n\n5 Conclusions and Future Work\n\nWe have presented an adaptive GAMP method for the estimation of i.i.d. vectors x observed through\na known linear transforms followed by an arbitrary, componentwise random transform. The proce-\ndure, which is a generalization of EM-GAMP methodology of [9, 10], estimates both the vector x\nas well as parameters in the source and componentwise output transform. In the case of large i.i.d.\nGaussian transforms with ML parameter estimation, it is shown that the adaptive GAMP method is\nprovably asymptotically consistent in that the parameter estimates converge to the true values. This\nconvergence result holds over a large class of models with essentially arbitrarily complex parame-\nterizations. Moreover, the algorithm is computationally ef\ufb01cient since it reduces the vector-valued\nestimation problem to a sequence of scalar estimation problems in Gaussian noise. We believe that\nthis method is applicable to a large class of linear-nonlinear models with provable guarantees and\nthat it can have applications in a wide range of problems. We have mentioned the use of the method\nfor learning sparse priors in compressed sensing. Future work will include possible extensions to\nnon-Gaussian matrices.\n\nReferences\n\n[1] M. Tipping, \u201cSparse Bayesian learning and the relevance vector machine,\u201d J. Machine Learning Research,\n\nvol. 1, pp. 211\u2013244, Sep. 2001.\n\n[2] M. West, \u201cBayesian factor regressionm models in the \u201clarge p, small n\u201d paradigm,\u201d Bayesian Statistics,\n\nvol. 7, 2003.\n\n8\n\n(a)(b)Noise variance ( )Measurement ratio ( )MSE (dB)MSE (dB)0.511.52(cid:239)14(cid:239)13(cid:239)12(cid:239)11(cid:239)10(cid:239)9(cid:239)8(cid:239)7Measurement ratio (m/n)MSE (dB) State EvolutionLASSOOracle GAMPAdaptive GAMP10(cid:239)310(cid:239)210(cid:239)1(cid:239)35(cid:239)30(cid:239)25(cid:239)20(cid:239)15(cid:239)10Noise Variance ((cid:109)2)MSE (dB) \f[3] D. Wipf and B. Rao, \u201cSparse Bayesian learning for basis selection,\u201d IEEE Trans. Signal Process., vol. 52,\n\nno. 8, pp. 2153\u20132164, Aug. 2004.\n\n[4] S. Ji, Y. Xue, and L. Carin, \u201cBayesian compressive sensing,\u201d IEEE Trans. Signal Process., vol. 56, pp.\n\n2346\u20132356, Jun. 2008.\n\n[5] V. Cevher, \u201cLearning with compressible priors,\u201d in Proc. NIPS, Vancouver, BC, Dec. 2009.\n[6] S. Billings and S. Fakhouri, \u201cIdenti\ufb01cation of systems containing linear dynamic and static nonlinear\n\nelements,\u201d Automatica, vol. 18, no. 1, pp. 15\u201326, 1982.\n\n[7] I. W. Hunter and M. J. Korenberg, \u201cThe identi\ufb01cation of nonlinear biological systems: Wiener and Ham-\n\nmerstein cascade models,\u201d Biological Cybernetics, vol. 55, no. 2\u20133, pp. 135\u2013144, 1986.\n\n[8] O. Schwartz, J. W. Pillow, N. C. Rust, and E. P. Simoncelli, \u201cSpike-triggered neural characterization,\u201d J.\n\nVision, vol. 6, no. 4, pp. 484\u2013507, Jul. 2006.\n\n[9] J. P. Vila and P. Schniter, \u201cExpectation-maximization Bernoulli-Gaussian approximate message passing,\u201d\nin Conf. Rec. 45th Asilomar Conf. Signals, Syst. & Comput., Paci\ufb01c Grove, CA, Nov. 2011, pp. 799\u2013803.\n[10] \u2014\u2014, \u201cExpectation-maximization Gaussian-mixture approximate message passing,\u201d in Proc. Conf. on\n\nInform. Sci. & Sys., Princeton, NJ, Mar. 2012.\n\n[11] F. Krzakala, M. M\u00b4ezard, F. Sausset, Y. Sun, and L. Zdeborov\u00b4a, \u201cStatistical physics-based reconstruction\n\nin compressed sensing,\u201d arXiv:1109.4424, Sep. 2011.\n\n[12] \u2014\u2014, \u201cProbabilistic reconstruction in compressed sensing: Algorithms, phase diagrams, and threshold\n\nachieving matrices,\u201d arXiv:1206.3953, Jun. 2012.\n\n[13] S. Rangan, A. K. Fletcher, V. K. Goyal, and P. Schniter, \u201cHybrid generalized approximation message\npassing with applications to structured sparsity,\u201d in Proc. IEEE Int. Symp. Inform. Theory, Cambridge,\nMA, Jul. 2012, pp. 1241\u20131245.\n\n[14] S. Rangan, \u201cGeneralized approximate message passing for estimation with random linear mixing,\u201d in\n\nProc. IEEE Int. Symp. Inform. Theory, Saint Petersburg, Russia, Jul.\u2013Aug. 2011, pp. 2174\u20132178.\n\n[15] D. Guo and C.-C. Wang, \u201cAsymptotic mean-square optimality of belief propagation for sparse linear\n\nsystems,\u201d in Proc. IEEE Inform. Theory Workshop, Chengdu, China, Oct. 2006, pp. 194\u2013198.\n\n[16] \u2014\u2014, \u201cRandom sparse linear systems observed via arbitrary channels: A decoupling principle,\u201d in Proc.\n\nIEEE Int. Symp. Inform. Theory, Nice, France, Jun. 2007, pp. 946\u2013950.\n\n[17] S. Rangan, \u201cEstimation with random linear mixing, belief propagation and compressed sensing,\u201d in Proc.\n\nConf. on Inform. Sci. & Sys., Princeton, NJ, Mar. 2010, pp. 1\u20136.\n\n[18] M. Bayati and A. Montanari, \u201cThe dynamics of message passing on dense graphs, with applications to\n\ncompressed sensing,\u201d IEEE Trans. Inform. Theory, vol. 57, no. 2, pp. 764\u2013785, Feb. 2011.\n\n[19] U. S. Kamilov, S. Rangan, A. K. Fletcher, and M. Unser, \u201cApproximate message passing with consistent\n\nparameter estimation and applications to sparse learning,\u201d arXiv:1207.3859 [cs.IT], Jul. 2012.\n\n[20] J. Boutros and G. Caire, \u201cIterative multiuser joint decoding: Uni\ufb01ed framework and asymptotic analysis,\u201d\n\nIEEE Trans. Inform. Theory, vol. 48, no. 7, pp. 1772\u20131793, Jul. 2002.\n\n[21] T. Tanaka and M. Okada, \u201cApproximate belief propagation, density evolution, and neurodynamics for\n\nCDMA multiuser detection,\u201d IEEE Trans. Inform. Theory, vol. 51, no. 2, pp. 700\u2013706, Feb. 2005.\n\n[22] D. L. Donoho, A. Maleki, and A. Montanari, \u201cMessage-passing algorithms for compressed sensing,\u201d\n\nProc. Nat. Acad. Sci., vol. 106, no. 45, pp. 18 914\u201318 919, Nov. 2009.\n\n[23] T. P. Minka, \u201cA family of algorithms for approximate Bayesian inference,\u201d Ph.D. dissertation, Mas-\n\nsachusetts Institute of Technology, Cambridge, MA, 2001.\n\n[24] M. Seeger, \u201cBayesian inference and optimal design for the sparse linear model,\u201d J. Machine Learning\n\nResearch, vol. 9, pp. 759\u2013813, Sep. 2008.\n\n[25] E. J. Cand`es and T. Tao, \u201cNear-optimal signal recovery from random projections: Universal encoding\n\nstrategies?\u201d IEEE Trans. Inform. Theory, vol. 52, no. 12, pp. 5406\u20135425, Dec. 2006.\n\n[26] D. Donoho, I. Johnstone, A. Maleki, and A. Montanari, \u201cCompressed sensing over (cid:96)p-balls: Minimax\n\nmean square error,\u201d in Proc. ISIT, St. Petersburg, Russia, Jun. 2011.\n\n9\n\n\f", "award": [], "sourceid": 1177, "authors": [{"given_name": "Ulugbek", "family_name": "Kamilov", "institution": null}, {"given_name": "Sundeep", "family_name": "Rangan", "institution": null}, {"given_name": "Michael", "family_name": "Unser", "institution": null}, {"given_name": "Alyson", "family_name": "Fletcher", "institution": null}]}