{"title": "A Bayes Rule for Density Matrices", "book": "Advances in Neural Information Processing Systems", "page_first": 1457, "page_last": 1464, "abstract": null, "full_text": "A Bayes Rule for Density Matrices\n\nManfred K. Warmuth Computer Science Department University of California at Santa Cruz manfred@cse.ucsc.edu\n\nAbstract\nThe classical Bayes rule computes the posterior model probability from the prior probability and the data likelihood. We generalize this rule to the case when the prior is a density matrix (symmetric positive definite and trace one) and the data likelihood a covariance matrix. The classical Bayes rule is retained as the special case when the matrices are diagonal. In the classical setting, the calculation of the probability of the data is an expected likelihood, where the expectation is over the prior distribution. In the generalized setting, this is replaced by an expected variance calculation where the variance is computed along the eigenvectors of the prior density matrix and the expectation is over the eigenvalues of the density matrix (which form a probability vector). The variances along any direction is determined by the covariance matrix. Curiously enough this expected variance calculation is a quantum measurement where the co-variance matrix specifies the instrument and the prior density matrix the mixture state of the particle. We motivate both the classical and the generalized Bayes rule with a minimum relative entropy principle, where the Kullbach-Leibler version gives the classical Bayes rule and Umegaki's quantum relative entropy the new Bayes rule for density matrices.\n\n1\n\nIntro duction\n\nIn [TRW05] various on-line updates were generalized from vector parameters to matrix parameters. Following [KW97], the updates were derived by minimizing the loss plus a divergence to the last parameter. In this paper we use the same method for deriving a Bayes rule for density matrices (symmetric positive definite matrices of trace one). When the parameters are probability vectors over the set of models, then the \"classical\" Bayes rule can be derived using the relative entropy as the divergence (e.g.[KW99, SWRL03]). Analogously we now use the quantum relative entropy, introduced by Umegaki, to derive the generalized Bayes rule.\n Supported by NSF grant CCR 9821087. Some of this work was done while visiting National ICT Australia in Canberra\n\n\f\nFigure 1: We update the prior four times based on the same data likelihood vector P (y |Mi ). The initial posteriors are close to the prior but eventually the posteriors focus their weight on argmaxi P (y |Mi ). The classical Bayes rule may be seen as a soft maximum calculation.\n\nFigure 2: We depict seven iterations of the generalized Bayes rule with the bold NWSE ellipse as the prior density and the bolddashed SE-NW ellipse as data covariance matrix. The posterior density matrices (dashed) gradually move from the prior to the longest axis of the covariance matrix.\n\nThe new rule uses matrix logarithms and exponentials to avoid the fact that symmetric positive definite matrices are not closed under the matrix product. The rule is strikingly similar to the classical Bayes rule and retains the latter as a special case when the matrices are diagonal. Various cancellations occur when the classical Bayes rule is applied iteratively and similar cancellations happen with the new rule. We shall see that the classical Bayes rule may be seen a soft maximum calculation and the new rule as a soft calculation of the eigenvector with the largest eigenvalue (See figures 1 and 2). The mathematics applied in this paper is most commonly used in quantum physics. For example, the data likelihood becomes a quantum measurement. It is tempting to call the new rule the \"quantum Bayes rule\". However, we have no physical interpretation of the this rule. The measurement does not collapse our state and we don't use the unitary evolution of a state to model the rule. Also, the term \"quantum Bayes rule\" has been claimed before in [SBC01] where the classical Bayes rule is used to update probabilities that happen to arise in the context of quantum physics. In contrast, in this paper our parameters are density matrices. Our work is most closely related to a paper by Cerf and Adam [CA99] who also give a formula for conditional densities that relies on the matrix exponential and logarithm. However they are interested in the multivariate case (which requires the use of tensors) and their motivation is to obtain a generalization of a conditional quantum entropy. We hope to build on the great body of work done with the classical Bayes rule in the statistics community and therefore believe that this line of research holds great promise.\n\n2\n\nThe Classical Bayes Rule\n\nTo establish a common notation we begin by introducing the familiar Bayes rule. Assume we have n models M1 , . . . , Mn . In the classical setup, model Mi is chosen with prior probability P (Mi ) and then Mi generates a datum y with probability P (y |Mi ). After observing y , the posterior probabilities of model Mi are calculated via Bayes Rule: P (Mi |y ) = P (Mi )P (y |Mi ) j . P (Mj )P (y |Mj ) (1)\n\n\f\nFigure 3: An ellipse S in R2 : The eigenvectors are the directions of the axes and the eigenvalues their lengths. Ellipses are weighted combinations of the onedimensional degenerate ellipses (dyads) corresponding to the axes. (For unit u, the dyad uu is a degenerate one-dimensional ellipse with its single axis in direction u.) The solid curve of the ellipse is a plot of S u and the outer dashed figure eight is direction u times the variance u S u. At the eigenvectors, this variance equals the eigenvalues and touches the ellipse.\n\nFigure 4: When the ellipse S and T don't have the same span, then S T lies in the intersection of both spans and is a degenerate ellipse of dimension one (bold line). This generalizes the following intersection property of the matrix product when S and T are both diagonal (here of dimension four): diag (S ) diag (T ) diag (S T ) 0 0 0 a 0 0 . 0 0 b a b ab\n\nSee Figure 1 for a bar plot of the effect of the update on the posterior. By the Theorem of Total Probability, the expected likelihood in the denominator equals P (y ). In a moment we will replace this expected likelihood by an expected variance.\n\n3\n\nDensity Matrices as Priors\n\nWe now let our prior D be an arbitrary symmetric positive1 definite matrix of trace one. Such matrices are called density matrices in quantum physici . An outer s product uuT , where u has unit length is called a dyad. Any mixture i ai ai of i dyads ai a is a density matrix as long as the coefficients i are non-negative and sum to one. This is true even if the number of dyads is larger or smaller than the dimension of D . The trace of such a mixture is one because dyads have trace one i and i = 1. Of course any density matrix D can be decomposed based on an eigensystem. That is, D = D D where D D = I . Now the vector of eigenvalues (i ) forms a probability vector equal to the dimension of the density. In quantum physics, the dyads are called pure states and density matrices are mixtures over such states. Note that in this paper we want to address the statistics community and use linear algebra notation instead of Dirac notation. The probability vector (P (Mi )) can be represented as a diagonal matrix diag ((P (Mi ))) = i P (Mi ) ei ei , where ei denotes the ith standard basis vector. This means that\n1 We use the convention that positive definite matrices have non-negative eigenvalues and strictly positive definite matrices have positive eigenvalues.\n\n\f\nprobability vectors are special density matrices where the eigenvectors are fixed to the standard basis vectors.\n\n4\n\nCo-variance Matrices and Basic Notation\n\nIn this paper we replace the (conditional) data likelihoods P (y |Mi ) by a data covariance matrix D (y |.) (symmetric positive definite matrix). We now discuss such matrices in more detail. A covariance matrix S can be depicted as an ellipse {S u : ||u||2 1} centered at the origin, where the eigenvectors form the principal axes and the eigenvalues are the lengths of the axes (See Figure 3). Assume S is the covariance matrix ( of some random cost vector c Rn , i.e. S = E c - E(c)(c - E(c)) . Note that a covariance matrix S is diagonal if the components of the cost vector are independent. The variance of the cost vector c along a unit vector u has the form cu 2 ( 2 V(c u) = E( - E(c u) ) = E( c - E(c )) u ) = u S u and the variance along an eigenvector is the corresponding eigenvalue (See Figure 3). Using this interpretation, the matrix S may be seen as a mapping S (.) from the unit ball to R0 , i.e. S (u) = u S u. A second interpretation of the sca r u S u is the square length of u w.r.t. the la basis S , that is u S u = u S S u = || S u||2 . Thirdly, uT S u is a quantum 2 measurement of the pure state u with an instrument represented by S . Since the square length of u w.r.t. any orthogonal basis S is one, any such basis turns the unit vector into an n-dimensional probability vector ((u si )2 )i. Now u S u is the expected eigenvalue w.r.t. this probability vector: u S u = i (u si )2 . The trace tr(A) of a square matrix A is the sum of its diagonal elements Aii . Recall that tr(AB ) = tr(B A) for any matrices A Rnm , B Rmn . The trace is unitarily invariant, i.e. for any orthogonal matrix U , tr(U AU ) = tr(U U A) = tr(A). Also, tr(uu A) = tr(u Au) = u Au. Therefore the trace of a square matrix may be seen as the total variance along any set of orthogonal directions: i i tr(A) = tr(I A) = tr( ui ui A ) = ui Aui . In particular, the trace of a square matrix is the sum of its eigenvalues. The matrix exponential exp(S ) of the symmetric matrix S = S S is defined as S exp( )S , where exp( ) is obtained by exponentiating the diagonal entries (eigenvalues). The matrix logarithm log(S ) is defined similarly but now S must be strictly positive definite. Clearly, the two functions are inverses of each other. It is important to remember that exp (S + T ) = exp(S ) exp(T ) only holds iff the two symmetric matrices commute2 , i.e. S T = T S . However, the following trace inequality, known as the Golden-Thompson inequality [Bha97], always holds: tr(exp S exp T ) tr(exp (S + T )). (2)\n\n5\n\nThe Generalized Bayes Rule\n\nThe following experiment underlies the more general setup: If the prior is D (.) = i i di di , then the dyad (or pure state) di di is chosen with probability i and a random variable c di is observed where c has covariance matrix D (y |.).\n2\n\nThis occurs iff the two symmetric matrices have the same eigensystem.\n\n\f\nIn i our generalization we replace the expected data likelihood P (y ) P (Mi )P (y |Mi ) by the following trace: i i tr(D (.)D (y |.)) = tr( i di di D (y |.)) = i di D (y |.)di .\n\n=\n\nRecall that di D (y |.)di is the variance of c in direction di : i.e. V(c di ). Therefore the above trace is the expected variance along the eigenvectors of the density matrix weighted by the eigenvalues. Curiously enough, this trace computation is a quantum measurement, where D (y |.) represents the instrument and D (.) the mixture state of the particle. In the generalized Bayes rule we cannot simply multiply the prior density matrix with the covariance matrix that corresponds to the data likelihood. This is because a product of two symmetric positive definite matrices may be neither symmetric nor positive definite. Instead we define the operation on the cone of symmetric positive definite matrices. We begin by defining this operation for the case when the matrices S and T are strictly positive definite (and symmetric): S T := exp(log S + log T ). (3)\n\nThe matrix log of both matrices produces symmetric matrices that sum to a symmetric matrix. Finally the matrix exponential of the sum produces again a symmetric positive matrix. Note that the matrix log is not defined when the matrix has a zero eigenvalue. However for arbitrary symmetric positive definite matrices one can define the operation as the following limit: S T := lim (S 1/n T 1/n )n .\nn\n\nThis limit is the Lie Product Formula [Bha97] when S and T are both strictly positive, but it exists even if the matrices don't have full rank and by Theorem 1.2 of [Sim79], rang e(S T ) = rang e(S ) rang e(T ). Assume that k is the dimension of rang e(S ) rang e(T ), that B is an orthonormal basis of rang e(S ) rang e(T ) (i.e. B Rnk , B T B = Ik , and rang e(B ) = rang e(S ) rang e(T )) and that log+ denotes the modified matrix logarithm that takes logs of the non-zero eigenvalues but leaves zero eigenvalues unchanged. Then by the same theorem3 , S T = B exp(B T (log+ S + log+ T )B ) B T . (4)\n\nWhen both matrices have the same eigensystem, then becomes the matrix product. One can show that is associative, commutative, has the identity matrix I as its neutral element and for any strictly positive definite and symmetric matrix S , S S -1 = I . Finally, (cS ) T = c(S T ), for any non-negative scalar. Using this new product operation, the generalized Bayes rule becomes: D (.|y ) = D (.) D(y|.) . tr(D (.) D(y|.)) (5)\n\nNormalizing by the trace assures that the trace of the posterior density matrix is one. As we see in Figure 2, this posterior moves toward the largest axis of the data covariance matrix and the new rule can be interpreted as a soft calculation of the\n3 ~ ~ ~~ ~ The log+ S term in the formula can be replaced by B log(B T S B )B T , where B is an orthonormal basis of rang e(S ), and similarly for log+ T .\n\n\f\n,,1 Figure 5:\n2\n\nand the data Assume the prior density matrix is the circle D (.) = 01 ,, 2 ,, 00 1 -1 1 U, =U covariance matrix the degenerate NE-SW ellipse D (y |.) = 2 01 -1 1 ,,1 1 2 2 where U = . Now for all diagonal matrices S (.), tr(S (.) D (y |.)) = 1 , i.e. 1 - 1 2 2 2 1 0 C | D (y |.)C = 1. U (0 0)U 01 A {z } D (.|y ) of new rule\n\n0\n\n\n\nB largest eigenvalue is not \"visible\" in basis I . But tr B @\n\neigenvector with maximum eigenvalue. When the matrices D (.) and D (y |.) have the same eigensystem, then becomes the matrix multiplication. In particular, when the prior is diag ((P (Mi ))) and the covariance matrix diag ((P (y |Mi )), then the new rule realizes the classical rule and computes diag ((P (Mi |y )). Figure 5 gives an example that shows how the off-diagonal elements can be exploited by the new rule. In the classical Bayes rule, the normalization factor is the expected data likelihood. In the case of the generalized Bayes rule, the expected variance only upper bounds the normalization factor via the Golden-Thompsen inequality (2): tr(D (.)D (y |.)) tr(D (.) D(y|.)). (6)\n\nThe classical Bayes rule can be applied iteratively to a sequence of data and various cancellations occur. For the sake of simplicity we only consider two data points y1 , y 2 : P (Mi )P (y1 |Mi )P (y2 |Mi , y1 ) P (Mi |y1 )P (y2 |Mi , y1 ) = . P (Mi |y2 y1 ) = P (y2 |y1 ) P (y2 y1 ) i i P (y2 |y1 )P (y1 ) = ( P (Mi |y1 ) P (y2 |Mi , y1 ))( P (Mi )P (y1 |Mi )) u\nse(1)\n\n= Analogously, D (.|y2 y1 ) =\n\ni\n\nP (Mi )P (y1 |Mi )P (y2 |Mi , y1 ) = P (y2 y1 ).\n\n(7)\n\nD (.|y1 ) D(y2 |., y1 ) D (.) D(y1 |.) D(y2 |., y1 ) = . tr(D (.|y1 ) D(y2 |., y1 )) tr(D (.) D(y1 |.) D(y2 |., y1 ))\n\n\f\nFinally, the product of the expected variance for both trials combine in a similar way, except that in the generalized case the equality becomes an inequality: tr(D (.|y1 )D (y2 |., y1 )) tr(D (.)D (y1 |.)) tr(D (.|y1 )) D(y2 |., y1 )) tr(D (.) u\nse(5)\n\nD(y1 |.))\n\n= - log tr(D (.)\n\nD(y1 |.)\n\nD(y2 |., y1 )).\n\nThe above inequality is an instantiation of the Golden-Thompsen inequality (2) and the above equality generalizes the middle equality in (7).\n\n6\n\nThe Derivation of the Generalized Bayes Rule\n\nThe classical Bayes rule can be derived4 by minimizing a relative entropy to the prior plus a convex combination of the log losses of the models (See e.g. [KW99, SWRL03]): i i i - i log P (y |Mi ). iP nf i ln P (Mi ) i 0, i i =1 Without the relative entropy, the argument of the infimum is linear in the weights i and is minimized when all weight is placed on the maximum likelihood models, i.e. the set of indices argmaxi P (y |Mi ). The negative entropy ameliorates the maximum calculation and pulls the optimal solution towards the prior. Observe that the non-negativity constraints can be dropped since the entropy acts as a barrier. By introducing a Lagrange multiplier for the remaining constraint and differentiating, | i) i )P we obtain the solution i = PP (MMj )(y(MMj ) , which is the classical Bayes rule (1). P y| j P( By plugging i into the argument of the infimum we obtain the optimum value - ln P (y ). Notice that this is minus the logarithm of the normalization of the Bayes rule (1) and is also the log loss associated the standard Bayesian setup. To derive the new generalized Bayes rule in an analogous way, we use the quantum physics generalizations of the relative entropy between two densities G and D (due to Umegaki): tr(G (log G - log D )). We also need to replace the mixture of negative log likelihoods by the trace -tr(G log D (y |.)). Now the matrix parameter G is constrained to be a density matrix and the minimization problem becomes5 :\nG\n\ninf tr(G (log G - log D (.)) - tr(G log D (y |.)) dens.matr.\n\nExcept for the quantum relative entropy term, the argument of the infimum is again linear in the variable G and is minimized when G is a single dyad uu , where u is the eigenvector belonging to maximum eigenvalue of the matrix log D (y |.). The linear term pulls G toward a direction of high variance of this matrix, whereas the quantum relative entropy pulls G toward the prior density matrix. The density matrix constraint requires the eigenvalues of G to be non-negative and the trace to G to be one. The entropy works as a barrier for the non-negativity constraints and thus these constraints can be dropped. Again by introducing a Lagrange multiplier for the remaining trace constraint and differentiating (following [TRW05]), we arrive at a formula for the optimum G which coincides with the formula for the D (.|y ) given in the generalized Bayes rule (5), where is defined6 as in (3). Since the quantum relative entropy is strictly convex [NC00] in G , the optimum G is unique.\n4 5\n\nFor the sake of simplicity assume that for all i, P (Mi ) and P (y |Mi ) are non-negative. Assume here that D (.) and D (y |.) are both strictly positive definite. 6 With some work, one can also derive the Bayes rule with the fancier operation (4).\n\n\f\n7\n\nConclusion\n\nOur generalized Bayes rule suggests a definition of conditional density matrices and we are currently developing a calculus for such matrices. In particular, a common formalism is needed that includes the multivariate conditional density matrices defined in [CA99] based on tensors. In this paper we only considered real symmetric matrices. However, our methods immediately generalize to complex Hermitian matrices, i.e square matrices in Cnn T for which S = S = S . Now both the prior density matrix and the data covariance matrix must be Hermitian instead of symmetric. The generalized Bayes rule for symmetric positive definite matrices relies on computing eigendecompositions ((n3 ) time). Hopefully, there exist O(n2 ) versions of the update that approximate the generalized Bayes rule sufficiently well. Extensive research has been done in the so-called \"expert framework\" (see e.g.[KW99] for a list of references) where a mixture over experts is maintained by the on-line algorithm for the purpose of performing as well as the best expert chosen in hindsight. In preliminary research we showed that one can maintain a density matrix over the base experts instead and derive updates similar to the generalized Bayes rule given in this paper. Most importantly, the bounds generalize to the case when mixtures over experts are replaced by density matrices. Acknowledgment: We would like to thank Dima Kuzmin for his extensive help with all aspects of this paper. Thanks also to Torsten Ehrhardt who first proved to us the range intersection and pro jection properties of the operation.\n\nReferences\n[Bha97] R. Bhatia. Matrix Analysis. Springer, Berlin, 1997. [CA99] N. J. Cerf and C. Adam. Quantum extension of conditional probability. Physical Review A, 60(2):893897, August 1999. [KW97] J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132(1):1 64, January 1997. [KW99] J. Kivinen and M. K. Warmuth. Averaging expert predictions. In Computational Learning Theory: 4th European Conference (EuroCOLT '99), pages 153167, Berlin, March 1999. Springer. [NC00] M.A. Nielsen and I.L. Chuang. Quantum Computation and Quantum Information. Cambridge University Press, 2000. [SBC01] R. Schack, T. A. Brun, and C. M. Caves. Quantum Bayes rule. Physical Review A, 64(014305), 2001. [Sim79] Barry Simon. Functional Integration and Quantum Physics. Academic Press, New York, 1979. [SWRL03] R. Singh, M. K. Warmuth, B. Ra j, and P. Lamere. Classificaton with free energy at raised temperatures. In Proc. of EUROSPEECH 2003, pages 17731776, September 2003. [TRW05] K. Tsuda, G. Rtsch, and M. K. Warmuth. Matrix exponentiated graa dient updates for on-line learning and Bregman pro jections. Journal of Machine Learning Research, 6:9951018, June 2005.\n\n\f\n", "award": [], "sourceid": 2793, "authors": [{"given_name": "Manfred K.", "family_name": "Warmuth", "institution": null}]}