{"title": "Rigorous Dynamics and Consistent Estimation in Arbitrarily Conditioned Linear Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 2545, "page_last": 2554, "abstract": "The problem of estimating a random vector x from noisy linear measurements y=Ax+w with unknown parameters on the distributions of x and w, which must also be learned, arises in a wide range of statistical learning and linear inverse problems. We show that a computationally simple iterative message-passing algorithm can provably obtain asymptotically consistent estimates in a certain high-dimensional large-system limit (LSL) under very general parameterizations. Previous message passing techniques have required i.i.d. sub-Gaussian A matrices and often fail when the matrix is ill-conditioned. The proposed algorithm, called adaptive vector approximate message passing (Adaptive VAMP) with auto-tuning, applies to all right-rotationally random A. Importantly, this class includes matrices with arbitrarily bad conditioning. We show that the parameter estimates and mean squared error (MSE) of x in each iteration converge to deterministic limits that can be precisely predicted by a simple set of state evolution (SE) equations. In addition, a simple testable condition is provided in which the MSE matches the Bayes-optimal value predicted by the replica method. The paper thus provides a computationally simple method with provable guarantees of optimality and consistency over a large class of linear inverse problems.", "full_text": "Rigorous Dynamics and Consistent Estimation in\n\nArbitrarily Conditioned Linear Systems\n\nAlyson K. Fletcher\n\nDept. Statistics\nUC Los Angeles\n\nakfletcher@ucla.edu\n\nMojtaba Sahraee-Ardakan\n\nSundeep Rangan\n\nmsahraee@ucla.edu\n\nsrangan@nyu.edu\n\nDept. EE,\n\nUC Los Angeles\n\nDept. ECE,\n\nNYU\n\nPhilip Schniter\n\nDept. ECE,\n\nThe Ohio State Univ.\n\nschniter@ece.osu.edu\n\nAbstract\n\nWe consider the problem of estimating a random vector x from noisy linear mea-\nsurements y = Ax + w in the setting where parameters \u03b8 on the distribution of\nx and w must be learned in addition to the vector x. This problem arises in a\nwide range of statistical learning and linear inverse problems. Our main contribu-\ntion shows that a computationally simple iterative message passing algorithm can\nprovably obtain asymptotically consistent estimates in a certain high-dimensional\nlarge system limit (LSL) under very general parametrizations. Importantly, this\nLSL applies to all right-rotationally random A \u2013 a much larger class of matrices\nthan i.i.d. sub-Gaussian matrices to which many past message passing approaches\nare restricted. In addition, a simple testable condition is provided in which the\nmean square error (MSE) on the vector x matches the Bayes optimal MSE pre-\ndicted by the replica method. The proposed algorithm uses a combination of\nExpectation-Maximization (EM) with a recently-developed Vector Approximate\nMessage Passing (VAMP) technique. We develop an analysis framework that\nshows that the parameter estimates in each iteration of the algorithm converge to\ndeterministic limits that can be precisely predicted by a simple set of state evolution\n(SE) equations. The SE equations, which extends those of VAMP without param-\neter adaptation, depend only on the initial parameter estimates and the statistical\nproperties of the problem and can be used to predict consistency and precisely\ncharacterize other performance measures of the method.\n\n1\n\nIntroduction\n\nConsider the problem of estimating a random vector x0 from linear measurements y of the form\n\n2 I), x0 \u223c p(x|\u03b81),\n\ny = Ax0 + w, w \u223c N (0, \u03b8\u22121\n\n(1)\nwhere A \u2208 RM\u00d7N is a known matrix, p(x|\u03b81) is a density on x0 with parameters \u03b81, w is additive\nwhite Gaussian noise (AWGN) independent of x0, and \u03b82 > 0 is the noise precision (inverse variance).\nThe goal is to estimate x0 along with simultaneously learning the unknown parameters \u03b8 := (\u03b81, \u03b82)\nfrom the data y and A. This problem arises in Bayesian forms of linear inverse problems in signal\nprocessing, as well as in linear regression in statistics.\nExact estimation of the parameters \u03b8 via maximum likelihood or other methods is generally intractable.\nOne promising class of approximate methods combines approximate message passing (AMP) [1]\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fwith expectation-maximization (EM). AMP and its generalizations [2] are a powerful, relatively\nrecent, class of algorithms based on expectation propagation-type techniques. The AMP methodology\nhas the bene\ufb01t of being computationally fast and has been successfully applied to a wide range of\nproblems. Most importantly, for large, i.i.d., sub-Gaussian random matrices A, the performance of\nAMP methods can be exactly predicted by a scalar state evolution (SE) [3, 4] that provides testable\nconditions for optimality, even for non-convex priors. When the parameters \u03b8 are unknown, AMP\ncan be easily combined with EM for joint learning of the parameters \u03b8 and vector x [5\u20137].\nA recent work [8] has combined EM with the so-called Vector AMP (VAMP) method of [9]. Similar to\nAMP, VAMP is based on expectation propagation (EP) approximations of belief propagation [10, 11]\nand can also be considered as a special case of expectation consistent (EC) approximate inference\n[12\u201314]. VAMP\u2019s key attraction is that it applies to a larger class of matrices A than standard AMP\nmethods. Aside from Gaussian i.i.d. A, standard AMP techniques often diverge and require a variety\nof modi\ufb01cations for stability [15\u201318]. In contrast, VAMP has provable SE analyses and convergence\nguarantees that apply to all right-rotationally invariant matrices A [9, 19] \u2013 a signi\ufb01cantly larger class\nof matrices than i.i.d. Gaussians. Under further conditions, the mean-squared error (MSE) of VAMP\nmatches the replica predictions for optimality [20\u201323]. For the case when the distribution on x and\nw are unknown, the work [8] proposed to combine EM and VAMP using the approximate inference\nframework of [24]. The combination of AMP with EM methods have been particularly successful\nin neural modeling problems [25, 26]. While [8] provides numerical simulations demonstrating\nexcellent performance of this EM-VAMP method on a range of synthetic data, there were no provable\nconvergence guarantees.\n\nContributions of this work The SE analysis thus provides a rigorous and exact characterization of\nthe dynamics of EM-VAMP. In particular, the analysis can determine under which initial conditions\nand problem statistics EM-VAMP will yield asymptotically consistent parameter estimates.\n\n\u2022 Rigorous state evolution analysis: We provide a rigorous analysis of a generalization of\nEM-VAMP that we call Adaptive VAMP. Similar to the analysis of VAMP, we consider\na certain large system limit (LSL) where the matrix A is random and right-rotationally\ninvariant. Importantly, this class of matrices is much more general than i.i.d. Gaussians used\nin the original LSL analysis of Bayati and Montanari [3]. It is shown (Theorem 1) that in\nthe LSL, the parameter estimates at each iteration converge to deterministic limits \u03b8k that\ncan be computed from a set of SE equations that extend those of VAMP. The analysis also\n\nexactly characterizes the asymptotic joint distribution of the estimates(cid:98)x and the true vector\n\nx0. The SE equations depend only on the initial parameter estimate, the adaptation function,\nand statistics on the matrix A, the vector x0 and noise w.\n\u2022 Asymptotic consistency: It is also shown (Theorem 2) that under an additional identi\ufb01ability\ncondition and a simple auto-tuning procedure, Adaptive VAMP can yield provably consistent\nparameter estimates in the LSL. The technique uses an ML estimation approach from [7].\nRemarkably, the result is true under very general problem formulations.\n\u2022 Bayes optimality: In the case when the parameter estimates converge to the true value, the\nbehavior of adaptive VAMP matches that of VAMP. In this case, it is shown in [9] that, when\nthe SE equations have a unique \ufb01xed point, the MSE of VAMP matches the MSE of the\nBayes optimal estimator predicted by the replica method [21\u201323].\n\nIn this way, we have developed a computationally ef\ufb01cient method for a large class of linear inverse\nproblems with the properties that, in a certain high-dimensional limit: (1) the performance of the\n\nalgorithm can be exactly characterized, (2) the parameter estimates(cid:98)\u03b8 are asymptotically consistent;\nand (3) the algorithm has testable conditions for which the signal estimates(cid:98)x match replica predictions\n\nfor Bayes optimality.\n\n2 VAMP with Adaptation\n\nAssume the prior on x can be written as\n\np(x|\u03b81) =\n\n1\n\nZ1(\u03b81)\n\nexp [\u2212f1(x|\u03b81)] ,\n\nf1(x|\u03b81) =\n\n2\n\nN(cid:88)\n\nn=1\n\nf1(xn|\u03b81),\n\n(2)\n\n\fAlgorithm 1 Adaptive VAMP\nRequire: Matrix A \u2208 RM\u00d7N , measurement vector y, denoiser function g1(\u00b7), statistic function\n\n\u03c61(\u00b7), adaptation function T1(\u00b7) and number of iterations Nit.\n\n1: Select initial r10, \u03b310 \u2265 0,(cid:98)\u03b810,(cid:98)\u03b820.\n\n1(r1k, \u03b31k,(cid:98)\u03b81k)(cid:105)\n\n// Input parameter update\n\n1k = \u03b31k/(cid:104)g(cid:48)\n\u03b7\u22121\n\u00b51k = (cid:104)\u03c61(r1k, \u03b31k,(cid:98)\u03b81k)(cid:105)\n\n// Input denoising\n\u03b32k = \u03b71k \u2212 \u03b31k\n\n2: for k = 0, 1, . . . , Nit \u2212 1 do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19: end for\n\n(cid:98)x1k = g1(r1k, \u03b31k,(cid:98)\u03b81k)),\nr2k = (\u03b71k(cid:98)x1k \u2212 \u03b31kr1k)/\u03b32k\n(cid:98)\u03b81,k+1 = T1(\u00b51k),\nk ((cid:98)\u03b82kATy + \u03b32kr2k),\n(cid:98)x2k = Q\u22121\n// Output estimation\n\u03b7\u22121\n2k = (1/N ) tr(Q\u22121\nr1,k+1 = (\u03b72k(cid:98)x2k \u2212 \u03b32kr2k)/\u03b31,k+1\nk )\n\u03b31,k+1 = \u03b72k \u2212 \u03b32k\n(cid:98)\u03b8\u22121\n2,k+1 = (1/N ){(cid:107)y \u2212 A(cid:98)x2k(cid:107)2 + tr(AQ\u22121\n\n// Output parameter update\n\nk AT)}\n\nQk =(cid:98)\u03b82kATA + \u03b32kI\n\nwhere f1(\u00b7) is a separable penalty function, \u03b81 is a parameter vector and Z1(\u03b81) is a normalization\nconstant. With some abuse of notation, we have used f1(\u00b7) for the function on the vector x and\nits components xn. Since f1(x|\u03b81) is separable, x has i.i.d. components conditioned on \u03b81. The\nlikelihood function under the Gaussian model (1) can be written as\n\np(y|x, \u03b82) :=\n\n1\n\nZ2(\u03b82)\n\nexp [\u2212f2(x, y|\u03b82)] ,\n\nf2(x, y|\u03b82) :=\n\n(cid:107)y \u2212 Ax(cid:107)2,\n\n\u03b82\n2\n\nwhere Z2(\u03b82) = (2\u03c0/\u03b82)N/2. The joint density of x, y given parameters \u03b8 = (\u03b81, \u03b82) is then\n\np(x, y|\u03b8) = p(x|\u03b81)p(y|x, \u03b82).\n\n(3)\n\n(4)\n\nThe problem is to estimate the parameters \u03b8 = (\u03b81, \u03b82) along with the vector x0.\nThe steps of the proposed adaptive VAMP algorithm to perform this estimation are shown in Al-\ngorithm 1, which is a generalization of the the EM-VAMP method in [8]. In each iteration, the\nvector x0. The algorithm is tuned by selecting three key functions: (i) a denoiser function g1(\u00b7); (ii)\nan adaptation statistic \u03c61(\u00b7); and (iii) a parameter selection function T1(\u00b7). The denoiser is used to\n\nalgorithm produces, for i = 1, 2, estimates(cid:98)\u03b8i of the parameter \u03b8i, along with estimates(cid:98)xik of the\nproduce the estimates(cid:98)x1k, while the adaptation statistic and parameter estimation functions produce\nthe estimates(cid:98)\u03b81k.\n\n(cid:104)\u2212fi(x, y|(cid:98)\u03b8i) \u2212 \u03b3i\n\nDenoiser function The denoiser function g1(\u00b7) is discussed in detail in [9] and is generally based\non the prior p(x|\u03b81). In the original EM-VAMP algorithm [8], g1(\u00b7) is selected as the so-called\nminimum mean-squared error (MMSE) denoiser. Speci\ufb01cally, in each iteration, the variables ri, \u03b3i\n\n(5)\nwhich represent estimates of the posterior density p(x|y, \u03b8). To keep the notation symmetric, we\nThe EM-VAMP method then selects g1(\u00b7) to be the mean of the belief estimate,\n\nand(cid:98)\u03b8i were used to construct belief estimates,\nbi(x|ri, \u03b3i,(cid:98)\u03b8i) \u221d exp\nhave written f1(x, y|(cid:98)\u03b81) for f1(x|(cid:98)\u03b81) even though the \ufb01rst penalty function does not depend on y.\n(cid:104)\u00b7(cid:105) for the empirical mean of a vector, i.e., (cid:104)u(cid:105) = (1/N )(cid:80)N\n\n(6)\n1(r1k, \u03b31k, \u03b81)]n := \u2202[g1(r1k, \u03b31k, \u03b81)]n/\u2202r1n and we use\nn=1 un. Hence, \u03b71k in line 4 is a scaled\n\nFor line 4 of Algorithm 1, we de\ufb01ne [g(cid:48)\n\ng1(r1, \u03b31, \u03b81) := E [x|r1, \u03b31, \u03b81] .\n\n(cid:107)x \u2212 ri(cid:107)2(cid:105)\n\n2\n\n,\n\n3\n\n\finverse divergence. It is shown in [9] that, for the MMSE denoiser (6), \u03b71k is the inverse average\nposterior variance.\n\nEstimation for \u03b81 with \ufb01nite statistics For the EM-VAMP algorithm [8], the parameter update\n\nfor(cid:98)\u03b81,k+1 is performed via a maximization\n(cid:98)\u03b81,k+1 = arg max\n\nE(cid:104)\n\n(cid:12)(cid:12)(cid:12)r1k, \u03b31k,(cid:98)\u03b81k\n\n(cid:105)\n\nln p(x|\u03b81)\n\nN(cid:88)\n\nn=1\n\n1\nN\n\n(7)\nwhere the expectation is with respect to the belief estimate bi(\u00b7) in (5). It is shown in [8] that using (7)\nis equivalent to an approximation of the M-step in the standard EM method. In the adaptive VAMP\nmethod in Algorithm 1, the M-step maximization (7) is replaced by line 9. Note that line 9 again uses\n(cid:104)\u00b7(cid:105) to denote empirical average,\n\n\u03b81\n\n,\n\n\u00b51k = (cid:104)\u03c61(r1k, \u03b31k,(cid:98)\u03b81k)(cid:105) :=\n\n\u03c61(r1k,n, \u03b31k,(cid:98)\u03b81k) \u2208 Rd,\n\n(8)\n\nso \u00b51k is the empirical average of some d-dimensional statistic \u03c61(\u00b7) over the components of r1k.\n\nThe parameter estimate update(cid:98)\u03b81,k+1 is then computed from some function of this statistic, T1(\u00b51k).\n\nWe show in the full paper [27] that there are two important cases where the EM update (7) can\nbe computed from a \ufb01nite-dimensional statistic as in line 9: (i) The prior p(x|\u03b81) is given by an\nexponential family, f1(x|\u03b81) = \u03b8T\n1 \u03d5(x) for some suf\ufb01cient statistic \u03d5(x); and (ii) There are a\n\ufb01nite number of values for the parameter \u03b81. For other cases, we can approximate more general\nparametrizations via discretization of the parameter values (cid:126)\u03b81. The updates in line 9 can also\nincorporate other types of updates as we will see below. But, we stress that it is preferable to compute\nthe estimate for \u03b81 directly from the maximization (7) \u2013 the use of a \ufb01nite-dimensional statistic is for\nthe sake of analysis.\n\nEstimation for \u03b82 with \ufb01nite statistics\nIt will be useful to also write the adaptation of \u03b82 in line 18\nof Algorithm 1 in a similar form as line 9. First, take a singular value decomposition (SVD) of A of\nthe form\n\nA = USVT, S = Diag(s),\n\n(9)\n\nand de\ufb01ne the transformed error and transformed noise,\n\n\u03be := UTw.\n\nqk := VT(r2k \u2212 x0),\n\nThen, it is shown in the full paper [27] that(cid:98)\u03b82,k+1 in line 18 can be written as\n, \u00b52k = (cid:104)\u03c62(q2, \u03be, s, \u03b32k,(cid:98)\u03b82k)(cid:105)\ns2(cid:98)\u03b82 + \u03b32\n(s2(cid:98)\u03b82 + \u03b32)2\n\n(cid:98)\u03b82,k+1 = T2(\u00b52k) :=\n\u03c62(q, \u03be, s, \u03b32,(cid:98)\u03b82) :=\n\n(sq + \u03be)2 +\n\n1\n\u00b52k\n\nwhere\n\n\u03b32\n2\n\ns2\n\n.\n\n(10)\n\n(11)\n\n(12)\n\nOf course, we cannot directly compute qk in (10) since we do not know the true x0. Nevertheless,\nthis form will be useful for analysis.\n\n3 State Evolution in the Large System Limit\n\n3.1 Large System Limit\n\nSimilar to the analysis of VAMP in [9], we analyze Algorithm 1 in a certain large system limit (LSL).\nThe LSL framework was developed by Bayati and Montanari in [3] and we review some of the key\nde\ufb01nitions in full paper [27]. As in the analysis of VAMP, the LSL considers a sequence of problems\nindexed by the vector dimension N. For each N, we assume that there is a \u201ctrue\u201d vector x0 \u2208 RN\nthat is observed through measurements of the form\n\ny = Ax0 + w \u2208 RN , w \u223c N (0, \u03b8\u22121\n\n2 IN ),\n\n(13)\n\n4\n\n\fwhere A \u2208 RN\u00d7N is a known transform, w is Gaussian noise and \u03b82 represents a \u201ctrue\u201d noise\nprecision. The noise precision \u03b82 does not change with N.\nIdentical to [9], the transform A is modeled as a large, right-orthogonally invariant random matrix.\nSpeci\ufb01cally, we assume that it has an SVD of the form (9) where U and V are N \u00d7 N orthogonal\nmatrices such that U is deterministic and V is Haar distributed (i.e. uniformly distributed on the set\nof orthogonal matrices). As described in [9], although we have assumed a square matrix A, we can\nconsider general rectangular A by adding zero singular values.\nUsing the de\ufb01nitions in full paper [27], we assume that the components of the singular-value vector\ns \u2208 RN in (9) converge empirically with second-order moments as\n\n(14)\nfor some non-negative random variable S with E[S] > 0 and S \u2208 [0, Smax] for some \ufb01nite maximum\nvalue Smax. Additionally, we assume that the components of the true vector, x0, and the initial input\nto the denoiser, r10, converge empirically as\n\n= S,\n\nN\u2192\u221e{sn} P L(2)\n\nlim\n\nN\u2192\u221e{(r10,n, x0\n\nlim\n\nn)} P L(2)\n\n= (R10, X 0), R10 = X 0 + P0, P0 \u223c N (0, \u03c410),\n\n(15)\n\nwhere X 0 is a random variable representing the true distribution of the components x0; P0 is an\ninitial error and \u03c410 is an initial error variance. The variable X 0 may be distributed as X 0 \u223c p(\u00b7|\u03b81)\nfor some true parameter \u03b81. However, in order to incorporate under-modeling, the existence of such\na true parameter is not required. We also assume that the initial second-order term and parameter\nestimate converge almost surely as\n\nN\u2192\u221e(\u03b310,(cid:98)\u03b810,(cid:98)\u03b820) = (\u03b310, \u03b810, \u03b820)\n\nlim\n\n(16)\n\nfor some \u03b310 > 0 and (\u03b810, \u03b820).\n\n3.2 Error and Sensitivity Functions\n\nE1(\u03b31, \u03c41,(cid:98)\u03b81) := E(cid:104)\n\nWe next need to introduce parametric forms of two key terms from [9]: error functions and sensitivity\nfunctions. The error functions describe MSE of the denoiser and output estimators under AWGN\n\nmeasurements. Speci\ufb01cally, for the denoiser g1(\u00b7, \u03b31,(cid:98)\u03b81), we de\ufb01ne the error function as\nfunction E1(\u03b31, \u03c41,(cid:98)\u03b81) thus represents the MSE of the estimate (cid:98)X = g1(R1, \u03b31,(cid:98)\u03b81) from a measure-\nment R1 corrupted by Gaussian noise of variance \u03c41 under the parameter estimate(cid:98)\u03b81. For the output\n\nwhere X 0 is distributed according to the true distribution of the components x0 (see above). The\n\n(g1(R1, \u03b31,(cid:98)\u03b81) \u2212 X 0)2(cid:105)\n\n, R1 = X 0 + P, P \u223c N (0, \u03c41),\n\n(17)\n\nestimator, we de\ufb01ne the error function as\n\nE2(\u03b32, \u03c42,(cid:98)\u03b82) := lim\n\nN\u2192\u221e\n\n1\nN\n\nE(cid:107)g2(r2, \u03b32,(cid:98)\u03b82) \u2212 x0(cid:107)2,\n\nx0 = r2 + q, q \u223c N (0, \u03c42I), y = Ax0 + w, w \u223c N (0, \u03b8\u22121\n\n(18)\nwhich is the average per component error of the vector estimate under Gaussian noise. The dependence\non the true noise precision, \u03b82, is suppressed.\nThe sensitivity functions describe the expected divergence of the estimator. For the denoiser, the\nsensitivity function is de\ufb01ned as\n\n2 I),\n\nA1(\u03b31, \u03c41,(cid:98)\u03b81) := E(cid:104)\n\ng(cid:48)\n\n(cid:105)\n1(R1, \u03b31,(cid:98)\u03b81)\nA2(\u03b32, \u03c42,(cid:98)\u03b82) := lim\n\nN\u2192\u221e\n\n(cid:34)\n\n\u2202g2(r2, \u03b32,(cid:98)\u03b82)\n\n(cid:35)\n\n\u2202r2\n\n1\nN\n\ntr\n\nwhich is the average derivative under a Gaussian noise input. For the output estimator, the sensitivity\nis de\ufb01ned as\n\n, R1 = X 0 + P, P \u223c N (0, \u03c41),\n\n(19)\n\n,\n\n(20)\n\nwhere r2 is distributed as in (18). The paper [9] discusses the error and sensitivity functions in detail\nand shows how these functions can be easily evaluated.\n\n5\n\n\f3.3 State Evolution Equations\n\nlim\n\nn), n = 1, . . . , N}.\n\nWe can now describe our main result, which are the SE equations for Adaptive VAMP. The equations\nare an extension of those in the VAMP paper [9], with modi\ufb01cations for the parameter estimation.\nFor a given iteration k \u2265 1, consider the set of components,\n\n{((cid:98)x1k,n, r1k,n, x0\nN\u2192\u221e{((cid:98)x1k,n, r1k,n, x0\n\nThis set represents the components of the true vector x0, its corresponding estimate(cid:98)x1k and the\n\ndenoiser input r1k. We will show that, under certain assumptions, these components converge\nempirically as\n\n= ((cid:98)X1k, R1k, X 0),\n(cid:98)X1k = g1(R1k, \u03b31k, \u03b81k),\n\nR1k = X 0 + Pk, Pk \u223c N (0, \u03c41k),\n\nn)} P L(2)\nwhere the random variables ((cid:98)X1k, R1k, X 0) are given by\nfor constants \u03b31k, \u03b81k and \u03c41k that will be de\ufb01ned below. We will also see that (cid:98)\u03b81k \u2192 \u03b81k, so \u03b81k\nn plus Gaussian noise. The corresponding estimate (cid:98)x1k,n then\nn and its corresponding(cid:98)x1k,n is identical to a simple scalar\n\nrepresents the asymptotic parameter estimate. The model (22) shows that each component r1k,n\nappears as the true component x0\nappears as the denoiser output with r1k,n as the input and \u03b81k as the parameter estimate. Hence, the\nasymptotic behavior of any component x0\nsystem. We will refer to (21)-(22) as the denoiser\u2019s scalar equivalent model.\nWe will also show that these transformed errors qk and noise \u03be in (10) and singular values s converge\nempirically to a set of independent random variables (Qk, \u039e, S) given by\n\n(22)\n\n(21)\n\nN\u2192\u221e{(qk,n, \u03ben, sn)} P L(2)\n\nlim\n\n= (Qk, \u039e, S), Qk \u223c N (0, \u03c42k), \u039e \u223c N (0, \u03b8\u22121\n2 ),\n\n(23)\n\nwhere S has the distribution of the singular values of A, \u03c42k is a variance that will be de\ufb01ned below\nand \u03b82 is the true noise precision in the measurement model (13). All the variables in (23) are\nindependent. Thus (23) is a scalar equivalent model for the output estimator.\nThe variance terms are de\ufb01ned recursively through the state evolution equations,\n\u03b32k = \u03b71k \u2212 \u03b31k\n\n\u03b11k = A1(\u03b31k, \u03c41k, \u03b81k),\n\n\u03b71k =\n\n(24a)\n\n,\n\n\u03b31k\n\u03b11k\n\n\u03b81,k+1 = T1(\u00b51k), \u00b51k = E(cid:2)\u03c61(R1k, \u03b31k, \u03b81k)(cid:3)\n(cid:3) ,\n(cid:2)E1(\u03b31k, \u03c41k, \u03b81k) \u2212 \u03b12\n\u03b82,k+1 = T2(\u00b52k), \u00b52k = E(cid:2)\u03c62(Qk, \u039e, S, \u03b32k, \u03b82k)(cid:3)\n\n\u03b12k = A2(\u03b32k, \u03c42k, \u03b82k),\n\n(1 \u2212 \u03b11k)2\n\n\u03b32k\n\u03b12k\n\n\u03b72k =\n\n\u03c42k =\n\n1\n\n,\n\n1k\u03c41k\n\u03b31,k+1 = \u03b72k \u2212 \u03b32k\n\n(cid:2)E2(\u03b32k, \u03c42k) \u2212 \u03b12\n\n(cid:3) ,\n\n1\n\n(24b)\n\n(24c)\n\n(24d)\n\n(24e)\n\n2k\u03c42k\n\n\u03c41,k+1 =\n\n(1 \u2212 \u03b12k)2\n\n(24f)\nwhich are initialized with \u03c410 = E[(R10 \u2212 X 0)2] and the (\u03b310, \u03b810, \u03b820) de\ufb01ned from the limit (16).\nThe expectation in (24b) is with respect to the random variables (21) and the expectation in (24e) is\nwith respect to the random variables (23).\nTheorem 1. Consider the outputs of Algorithm 1. Under the above assumptions and de\ufb01nitions,\nassume additionally that for all iterations k:\n(i) The solution \u03b11k from the SE equations (24) satis\ufb01es \u03b11k \u2208 (0, 1).\n\n(ii) The functions Ai(\u00b7), Ei(\u00b7) and Ti(\u00b7) are continuous at (\u03b3i, \u03c4i,(cid:98)\u03b8i, \u00b5i) = (\u03b3ik, \u03c4ik, \u03b8ik, \u00b5ik).\n1(r1, \u03b31,(cid:98)\u03b81) are uniformly Lipschitz\n(iii) The denoiser function g1(r1, \u03b31,(cid:98)\u03b81) and its derivative g(cid:48)\nin r1 at (\u03b31,(cid:98)\u03b81) = (\u03b31k, \u03b81k). (See the full paper [27]. for a precise de\ufb01nition of uniform\n\nLipschitz continuity.)\n\n6\n\n\f(iv) The adaptation statistic \u03c61(r1, \u03b31,(cid:98)\u03b81) is uniformly pseudo-Lipschitz of order 2 in r1 at\n(\u03b31,(cid:98)\u03b81) = (\u03b31k, \u03b81k).\n\nThen, for any \ufb01xed iteration k \u2265 0,\n\nN\u2192\u221e(\u03b1ik, \u03b7ik, \u03b3ik, \u00b5ik,(cid:98)\u03b8ik) = (\u03b1ik, \u03b7ik, \u03b3ik, \u00b5ik, \u03b8ik)\n\nlim\n\n(25)\n\nalmost surely. In addition, the empirical limit (21) holds almost surely for all k > 0, and (23) holds\nalmost surely for all k \u2265 0.\n\nTheorem 1 shows that, in the LSL, the parameter estimates(cid:98)\u03b8ik converge to deterministic limits \u03b8ik\n\nthat can be precisely predicted by the state-evolution equations. The SE equations incorporate the true\ndistribution of the components on the prior x0, the true noise precision \u03b82, and the speci\ufb01c parameter\nestimation and denoiser functions used by the Adaptive VAMP method. In addition, similar to the SE\nanalysis of VAMP in [9], the SE equations also predict the asymptotic joint distribution of x0 and\n\ntheir estimates(cid:98)xik. This joint distribution can be used to measure various performance metrics such\n\nas MSE \u2013 see [9]. In this way, we have provided a rigorous and precise characterization of a class of\nadaptive VAMP algorithms that includes EM-VAMP.\n\n4 Consistent Parameter Estimation with Variance Auto-Tuning\n\nBy comparing the deterministic limits \u03b8ik with the true parameters \u03b8i, one can determine under which\nproblem conditions the parameter estimates of adaptive VAMP are asymptotically consistent. In this\nsection, we show with a particular choice of parameter estimation functions, one can obtain provably\nasymptotically consistent parameter estimates under suitable identi\ufb01ability conditions. We call the\nmethod variance auto-tuning, which generalizes the approach in [7].\nDe\ufb01nition 1. Let p(x|\u03b81) be a parametrized set of densities. Given a \ufb01nite-dimensional statistic\n\u03c61(r), consider the mapping\n\n(\u03c41, \u03b81) (cid:55)\u2192 E [\u03c61(R)|\u03c41, \u03b81] , R = X + N (0, \u03c41), X \u223c p(x|\u03b81).\n\n(26)\nWe say the p(x|\u03b81) is identi\ufb01able in Gaussian noise if there exists a \ufb01nite-dimensional statistic\n\u03c61(r) \u2208 Rd such that (i) \u03c61(r) is pseudo-Lipschitz continuous of order 2; and (ii) the mapping (26)\nhas a continuous inverse.\nTheorem 2. Under the assumptions of Theorem 1, suppose that X 0 follows X 0 \u223c p(x|\u03b80\ntrue parameter \u03b80\n\nthat, for any iteration k, the estimate(cid:98)\u03b81k and noise estimate(cid:98)\u03c41k are asymptotically consistent in that\nlimN\u2192\u221e(cid:98)\u03b81k = \u03b80\n\n1) for some\n1. If p(x|\u03b81) is identi\ufb01able in Gaussian noise, there exists an adaptation rule such\n1 and limN\u2192\u221e(cid:98)\u03c41k = \u03c41k almost surely.\n\nThe theorem is proved in full paper [27]. which also provides details on how to perform the\nadaptation. A similar result for consistent estimation of the noise precision \u03b82 is also given. The\nresult is remarkable as it shows that a simple variant of EM-VAMP can provide provably consistent\nparameter estimates under extremely general distributions.\n\n5 Numerical Simulations\n\nSparse signal recovery: The paper [8] presented several numerical experiments to assess the\nperformance of EM-VAMP relative to other methods. Here, our goal is to con\ufb01rm that EM-VAMP\u2019s\nperformance matches the SE predictions. As in [8], we consider a sparse linear regression problem of\nestimating a vector x from measurements y from (1) without knowing the signal parameters \u03b81 or\nthe noise precision \u03b82 > 0. Details are given in the full paper [27]. Brie\ufb02y, to model the sparsity, x is\ndrawn as an i.i.d. Bernoulli-Gaussian (i.e., spike and slab) prior with unknown sparsity level, mean\nand variance. The true sparsity is \u03b2x = 0.1. Following [15, 16], we take A \u2208 RM\u00d7N to be a random\nright-orthogonally invariant matrix with dimensions under M = 512, N = 1024 with the condition\nnumber set to \u03ba = 100 (high condition number matrices are known to be problem for conventional\nAMP methods). The left panel of Fig. 1 shows the normalized mean square error (NMSE) for various\nalgorithms. The full paper [27] describes the algorithms in details and also shows similar results for\n\u03ba = 10.\n\n7\n\n\fFigure 1: Numerical simulations. Left panel: Sparse signal recovery: NMSE versus iteration for\ncondition number for a random matrix with a condition number \u03ba = 100. Right panel: NMSE for\nsparse image recovery as a function of the measurement ratio M/N.\n\nWe see several important features. First, for all variants of VAMP and EM-VAMP, the SE equations\nprovide an excellent prediction of the per iteration performance of the algorithm. Second, consistent\nwith the simulations in [9], the oracle VAMP converges remarkably fast (\u223c 10 iterations). Third,\nthe performance of EM-VAMP with auto-tuning is virtually indistinguishable from oracle VAMP,\nsuggesting that the parameter estimates are near perfect from the very \ufb01rst iteration. Fourth, the EM-\nVAMP method performs initially worse than the oracle-VAMP, but these errors are exactly predicted\nby the SE. Finally, all the VAMP and EM-VAMP algorithm exhibit much faster convergence than the\nEM-BG-AMP. In fact, consistent with observations in [8], EM-BG-AMP begins to diverge at higher\ncondition numbers. In contrast, the VAMP algorithms are stable.\n\nCompressed sensing image recovery While the theory is developed on theoretical signal priors,\nwe demonstrate that the proposed EM-VAMP algorithm can be effective on natural images. Specif-\nically, we repeat the experiments in [28] for recovery of a sparse image. Again, see the full paper\n[27] for details including a picture of the image and the various reconstructions. An N = 256 \u00d7 256\nimage of a satellite with K = 6678 pixels is transformed through an undersampled random transform\nA = diag(s)PH, where H is fast Hadamard transform, P is a random subselection to M measure-\nments and s is a scaling to adjust the condition number. As in the previous example, the image vector\nx is modeled as a sparse Bernoulli-Gaussian and the EM-VAMP algorithm is used to estimate the\nsparsity ratio, signal variance and noise variance. The transform is set to have a condition number\nof \u03ba = 100. We see from the right panel of Fig. 1 we see that the that the EM-VAMP algorithm is\nable to reconstruct the images with improved performance over the standard basis pursuit denoising\nmethod spgl1 [29] and the EM-BG-GAMP method from [16].\n\n6 Conclusions\n\nDue to its analytic tractability, computational simplicity, and potential for Bayes optimal inference,\nVAMP is a promising technique for statistical linear inverse problems. However, a key challenge in\nusing VAMP and related methods is the need to precisely specify the distribution on the problem\nparameters. This work provides a rigorous foundation for analyzing VAMP in combination with\nvarious parameter adaptation techniques including EM. The analysis reveals that VAMP with\nappropriate tuning, can also provide consistent parameter estimates under very general settings, thus\nyielding a powerful approach for statistical linear inverse problems.\n\nAcknowledgments\n\nA. K. Fletcher and M. Saharee-Ardakan were supported in part by the National Science Foundation\nunder Grants 1254204 and 1738286 and the Of\ufb01ce of Naval Research under Grant N00014-15-1-2677.\nS. Rangan was supported in part by the National Science Foundation under Grants 1116589, 1302336,\nand 1547332, and the industrial af\ufb01liates of NYU WIRELESS. The work of P. Schniter was supported\nin part by the National Science Foundation under Grant CCF-1527162.\n\n8\n\n\fReferences\n[1] D. L. Donoho, A. Maleki, and A. Montanari, \u201cMessage-passing algorithms for compressed sensing,\u201d Proc.\n\nNat. Acad. Sci., vol. 106, no. 45, pp. 18 914\u201318 919, Nov. 2009.\n\n[2] S. Rangan, \u201cGeneralized approximate message passing for estimation with random linear mixing,\u201d in Proc.\n\nIEEE Int. Symp. Inform. Theory, Saint Petersburg, Russia, Jul.\u2013Aug. 2011, pp. 2174\u20132178.\n\n[3] M. Bayati and A. Montanari, \u201cThe dynamics of message passing on dense graphs, with applications to\n\ncompressed sensing,\u201d IEEE Trans. Inform. Theory, vol. 57, no. 2, pp. 764\u2013785, Feb. 2011.\n\n[4] A. Javanmard and A. Montanari, \u201cState evolution for general approximate message passing algorithms,\n\nwith applications to spatial coupling,\u201d Information and Inference, vol. 2, no. 2, pp. 115\u2013144, 2013.\n\n[5] F. Krzakala, M. M\u00e9zard, F. Sausset, Y. Sun, and L. Zdeborov\u00e1, \u201cStatistical-physics-based reconstruction in\n\ncompressed sensing,\u201d Physical Review X, vol. 2, no. 2, p. 021005, 2012.\n\n[6] J. P. Vila and P. Schniter, \u201cExpectation-maximization Gaussian-mixture approximate message passing,\u201d\n\nIEEE Trans. Signal Processing, vol. 61, no. 19, pp. 4658\u20134672, 2013.\n\n[7] U. S. Kamilov, S. Rangan, A. K. Fletcher, and M. Unser, \u201cApproximate message passing with consistent\nparameter estimation and applications to sparse learning,\u201d IEEE Trans. Info. Theory, vol. 60, no. 5, pp.\n2969\u20132985, Apr. 2014.\n\n[8] A. K. Fletcher and P. Schniter, \u201cLearning and free energies for vector approximate message passing,\u201d Proc.\n\nIEEE ICASSP, March 2017.\n\n[9] S. Rangan, P. Schniter, and A. K. Fletcher, \u201cVector approximate message passing,\u201d Proc. IEEE ISIT, June\n\n2017.\n\n[10] M. Seeger, \u201cBayesian inference and optimal design for the sparse linear model,\u201d J. Machine Learning\n\nResearch, vol. 9, pp. 759\u2013813, Sep. 2008.\n\n[11] M. W. Seeger and H. Nickisch, \u201cFast convergent algorithms for expectation propagation approximate\nbayesian inference,\u201d in International Conference on Arti\ufb01cial Intelligence and Statistics, 2011, pp. 652\u2013660.\n\n[12] M. Opper and O. Winther, \u201cExpectation consistent free energies for approximate inference,\u201d in Proc. NIPS,\n\n2004, pp. 1001\u20131008.\n\n[13] \u2014\u2014, \u201cExpectation consistent approximate inference,\u201d J. Mach. Learning Res., vol. 1, pp. 2177\u20132204,\n\n2005.\n\n[14] A. K. Fletcher, M. Sahraee-Ardakan, S. Rangan, and P. Schniter, \u201cExpectation consistent approximate\n\ninference: Generalizations and convergence,\u201d in Proc. IEEE ISIT, 2016, pp. 190\u2013194.\n\n[15] S. Rangan, P. Schniter, and A. Fletcher, \u201cOn the convergence of approximate message passing with arbitrary\n\nmatrices,\u201d in Proc. IEEE ISIT, Jul. 2014, pp. 236\u2013240.\n\n[16] J. Vila, P. Schniter, S. Rangan, F. Krzakala, and L. Zdeborov\u00e1, \u201cAdaptive damping and mean removal for\nthe generalized approximate message passing algorithm,\u201d in Proc. IEEE ICASSP, 2015, pp. 2021\u20132025.\n\n[17] A. Manoel, F. Krzakala, E. W. Tramel, and L. Zdeborov\u00e1, \u201cSwept approximate message passing for sparse\n\nestimation,\u201d in Proc. ICML, 2015, pp. 1123\u20131132.\n\n[18] S. Rangan, A. K. Fletcher, P. Schniter, and U. S. Kamilov, \u201cInference for generalized linear models via\nalternating directions and Bethe free energy minimization,\u201d IEEE Transactions on Information Theory,\nvol. 63, no. 1, pp. 676\u2013697, 2017.\n\n[19] K. Takeuchi, \u201cRigorous dynamics of expectation-propagation-based signal recovery from unitarily invariant\n\nmeasurements,\u201d Proc. IEEE ISIT, June 2017.\n\n[20] S. Rangan, A. Fletcher, and V. K. Goyal, \u201cAsymptotic analysis of MAP estimation via the replica method\nand applications to compressed sensing,\u201d IEEE Trans. Inform. Theory, vol. 58, no. 3, pp. 1902\u20131923, Mar.\n2012.\n\n[21] A. M. Tulino, G. Caire, S. Verd\u00fa, and S. Shamai, \u201cSupport recovery with sparsely sampled free random\n\nmatrices,\u201d IEEE Trans. Inform. Theory, vol. 59, no. 7, pp. 4243\u20134271, 2013.\n\n[22] J. Barbier, M. Dia, N. Macris, and F. Krzakala, \u201cThe mutual information in random linear estimation,\u201d\n\narXiv:1607.02335, 2016.\n\n9\n\n\f[23] G. Reeves and H. D. P\ufb01ster, \u201cThe replica-symmetric prediction for compressed sensing with Gaussian\n\nmatrices is exact,\u201d in Proc. IEEE ISIT, 2016.\n\n[24] T. Heskes, O. Zoeter, and W. Wiegerinck, \u201cApproximate expectation maximization,\u201d NIPS, vol. 16, pp.\n\n353\u2013360, 2004.\n\n[25] A. K. Fletcher, S. Rangan, L. Varshney, and A. Bhargava, \u201cNeural reconstruction with approximate\nmessage passing (NeuRAMP),\u201d in Proc. Neural Information Process. Syst., Granada, Spain, Dec. 2011, pp.\n2555\u20132563.\n\n[26] A. K. Fletcher and S. Rangan, \u201cScalable inference for neuronal connectivity from calcium imaging,\u201d in\n\nProc. Neural Information Processing Systems, 2014, pp. 2843\u20132851.\n\n[27] A. Fletcher, M. Sahraee-Ardakan, S. Rangan, and P. Schniter, \u201cRigorous dynamics and consistent estimation\n\nin arbitrarily conditioned linear systems,\u201d arxiv, 2017.\n\n[28] J. P. Vila and P. Schniter, \u201cAn empirical-Bayes approach to recovering linearly constrained non-negative\n\nsparse signals,\u201d IEEE Trans. Signal Process., vol. 62, no. 18, pp. 4689\u20134703, 2014.\n\n[29] E. Van Den Berg and M. P. Friedlander, \u201cProbing the pareto frontier for basis pursuit solutions,\u201d SIAM\n\nJournal on Scienti\ufb01c Computing, vol. 31, no. 2, pp. 890\u2013912, 2008.\n\n10\n\n\f", "award": [], "sourceid": 1481, "authors": [{"given_name": "Alyson", "family_name": "Fletcher", "institution": "UCLA"}, {"given_name": "Mojtaba", "family_name": "Sahraee-Ardakan", "institution": "UCLA"}, {"given_name": "Sundeep", "family_name": "Rangan", "institution": "NYU-Poly"}, {"given_name": "Philip", "family_name": "Schniter", "institution": "Ohio State University"}]}