{"title": "Learning and Inference in Hilbert Space with Quantum Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 10338, "page_last": 10347, "abstract": "Quantum Graphical Models (QGMs) generalize classical graphical models by adopting the formalism for reasoning about uncertainty from quantum mechanics. Unlike classical graphical models, QGMs represent uncertainty with density matrices in complex Hilbert spaces. Hilbert space embeddings (HSEs) also generalize Bayesian inference in Hilbert spaces. We investigate the link between QGMs and HSEs and show that the sum rule and Bayes rule for QGMs are equivalent to the kernel sum rule in HSEs and a special case of Nadaraya-Watson kernel regression, respectively. We show that these operations can be kernelized, and use these insights to propose a Hilbert Space Embedding of Hidden Quantum Markov Models (HSE-HQMM) to model dynamics. We present experimental results showing that HSE-HQMMs are competitive with state-of-the-art models like LSTMs and PSRNNs on several datasets, while also providing a nonparametric method for maintaining a probability distribution over continuous-valued features.", "full_text": "Learning and Inference in Hilbert Space with\n\nQuantum Graphical Models\n\nSiddarth Srinivasan\nCollege of Computing\n\nGeorgia Tech\n\nAtlanta, GA 30332\n\nsidsrini@gatech.edu\n\nCarlton Downey\n\nDepartment of Machine Learning\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\ncmdowney@cs.cmu.edu\n\nByron Boots\n\nCollege of Computing\n\nGeorgia Tech\n\nAtlanta, GA 30332\n\nbboots@cc.gatech.edu\n\nAbstract\n\nQuantum Graphical Models (QGMs) generalize classical graphical models by\nadopting the formalism for reasoning about uncertainty from quantum mechanics.\nUnlike classical graphical models, QGMs represent uncertainty with density matri-\nces in complex Hilbert spaces. Hilbert space embeddings (HSEs) also generalize\nBayesian inference in Hilbert spaces. We investigate the link between QGMs\nand HSEs and show that the sum rule and Bayes rule for QGMs are equivalent\nto the kernel sum rule in HSEs and a special case of Nadaraya-Watson kernel\nregression, respectively. We show that these operations can be kernelized, and\nuse these insights to propose a Hilbert Space Embedding of Hidden Quantum\nMarkov Models (HSE-HQMM) to model dynamics. We present experimental\nresults showing that HSE-HQMMs are competitive with state-of-the-art models\nlike LSTMs and PSRNNs on several datasets, while also providing a nonparametric\nmethod for maintaining a probability distribution over continuous-valued features.\n\nIntroduction and Related Work\n\n1\nVarious formulations of Quantum Graphical Models (QGMs) have been proposed by researchers in\nphysics and machine learning [Srinivasan et al., 2018, Yeang, 2010, Leifer and Poulin, 2008] as a\nway of generalizing probabilistic inference on graphical models by adopting quantum mechanics\u2019\nformalism for reasoning about uncertainty. While Srinivasan et al. [2018] focused on modeling\ndynamical systems with Hidden Quantum Markov Models (HQMMs) [Monras et al., 2010], they\nalso describe the basic operations on general quantum graphical models, which generalize Bayesian\nreasoning within a framework consistent with quantum mechanical principles. Inference using Hilbert\nspace embeddings (HSE) is also a generalization of Bayesian reasoning, where data is mapped to a\nHilbert space in which kernel sum, chain, and Bayes rules can be used [Smola et al., 2007, Song et al.,\n2009, 2013]. These methods can model dynamical systems such as HSE-HMMs [Song et al., 2010],\nHSE-PSRs [Boots et al., 2012], and PSRNNs [Downey et al., 2017]. [Schuld and Killoran, 2018]\npresent related but orthogonal work connecting kernels, Hilbert spaces, and quantum computing.\nSince quantum states live in complex Hilbert spaces, and both QGMs and HSEs generalize Bayesian\nreasoning, it is natural to ask: what is the relationship between quantum graphical models and Hilbert\nspace embeddings? This is precisely the question we tackle in this paper. Overall, we present four\ncontributions: (1) we show that the sum rule for QGMs is identical to the kernel sum rule for HSEs,\nwhile the Bayesian update in QGMs is equivalent to performing Nadaraya-Watson kernel regression,\n(2) we show how to kernelize these operations and argue that with the right choice of features, we are\nmapping our data to quantum systems and modeling dynamics as quantum state evolution, (3) we use\nthese insights to propose a HSE-HQMM to model dynamics by mapping data to quantum systems\nand performing inference in Hilbert space, and, \ufb01nally, (4) we present a learning algorithm and\nexperimental results showing that HSE-HQMMs are competitive with other state-of-the-art methods\nfor modeling sequences, while also providing a nonparametric method for estimating the distribution\nof continuous-valued features.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f2 Quantum Graphical Models\n2.1 Classical vs Quantum Probability\nIn classical discrete graphical models, an observer\u2019s uncertainty about a random variable X can\nbe represented by a vector ~xwhose entries give the probability of X being in various states. In\nquantum mechanics, we write the \u2018pure\u2019 quantum state of a particle A as | iA, a complex-valued\ncolumn-vector in some orthonormal basis that lives in a Hilbert space, whose entries are \u2018probability\namplitudes\u2019 of system states. The squared norm of these probability amplitudes gives the probability\nof the corresponding system state, so the sum of squared norms of the entries must be 1. To describe\n\u2018mixed states\u2019, where we have a probabilistic mixture of quantum states, (e.g. a mixture of N quantum\nsystems, each with probability pi) we use a Hermitian \u2018density matrix\u2019, de\ufb01ned as follows:\n\nNXi\n\n\u02c6\u21e2 =\n\npi| iih i|\n\n(1)\n\nThe diagonal entries of a density matrix give the probabilities of being in each system state, and\noff-diagonal elements represent quantum coherences, which have no classical interpretation. Conse-\nquently, the normalization condition is tr(\u02c6\u21e2) = 1. Uncertainty about an n-state system is represented\nby an n \u21e5 n density matrix. The density matrix is the quantum analogue of the classical belief ~x.\n2.2 Operations on Quantum Graphical Models\nHere, we further develop the operations on QGMs introduced by Srinivasan et al. [2018], working\nwith the notion that the density matrix is the quantum analogue of a classical belief state.\nJoint Distributions The joint distribution of an n-state variable A and m-state variable B can be\nwritten as an nm \u21e5 nm \u2018joint density matrix\u2019 \u02c6\u21e2AB. When A and B are independent, \u02c6\u21e2AB = \u02c6\u21e2A \u2326 \u02c6\u21e2B.\nAs a valid density matrix, the diagonal elements represent probabilities corresponding to the states in\nthe Cartesian product of the basis states of the composite variables (so tr (\u02c6\u21e2AB) = 1).\nMarginalization Given a joint density matrix, we can recover the marginal \u2018reduced density matrix\u2019\nfor a subsystem of interest with the \u2018partial trace\u2019 operation. This operation is the quantum analogue\nof classical marginalization. For example, the partial trace for a two-variable joint system \u02c6\u21e2AB where\nwe trace over the second particle to obtain the state of the \ufb01rst particle is:\n\n\u02c6\u21e2A = trB (\u02c6\u21e2AB) =Xj\n\nBhj|\u02c6\u21e2AB|jiB\n\n(2)\n\nFinally, we discuss the quantum analogues of the sum rule and Bayes rule. Consider a prior ~\u21e1 = P (X)\nand a likelihood P (Y |X) represented by the column stochastic matrix A. We can then ask two\nquestions: what are P (Y ) and P (X|y)?\nSum Rule The classical answer to the \ufb01rst question involves multiplying the likelihood with the\nprior and marginalizing out X, like so:\n\nP (Y ) =Xx\n\nP (Y |x)P (x) = A~\u21e1\n\n(3)\n\nSrinivasan et al. [2018] show how we can construct a quantum circuit to perform the classical sum\nrule (illustrated in Figure 1a, see appendix for note on interpreting quantum circuits). First, recall\nthat all operations on quantum states must be represented by unitary matrices in order to preserve\nthe 2-norm of the state. \u02c6\u21e2env is an environment \u2018particle\u2019 always prepared in the same state that will\neventually encode \u02c6\u21e2Y : it is initially a matrix with zeros everywhere except \u02c6\u21e21,1 = 1. Then, if the prior\n~\u21e1 is encoded in a density matrix \u02c6\u21e2X, and the likelihood table A is encoded in a higher-dimensional\nunitary matrix, we can replicate the classical sum rule. Letting the prior \u02c6\u21e2X be any density matrix and\n\u02c6U1 be any unitary matrix generalizes the circuit to perform the \u2018quantum sum rule\u2019. This circuit can\nbe written as the following operation (the unitary matrix produces the joint distribution, the partial\ntrace carries out the marginalization):\n\n\u02c6\u21e2Y = trX\u21e3 \u02c6U1 (\u02c6\u21e2X \u2326 \u02c6\u21e2env) \u02c6U\u20201\u2318\n\n(4)\n\nBayes Rule Classically, we perform Bayesian update as follows (where diag(A(:,y)) selects the\nrow of matrix A corresponding to observation y and stacks it along a diagonal):\n\nP (X|y) =\n\nP (y|X)P (X)\n\nPx(y|x)P (x)\n\n2\n\n=\n\ndiag(A(y,:))~\u21e1\n1T diag(A(y,:))~\u21e1\n\n(5)\n\n\f\u02c6\u21e2X\n\n\u02c6\u21e2env\n\n\u02c6U1\n\n\u02c6\u21e2Y\n\n\u02c6\u21e2X\n\n\u02c6\u21e2env\n\n\u02c6\u21e2X|y\n\n\u02c6U2\n\n\u02c6\u21e2env\n\n\u02c6\u21e2y\n\n\u02c6U \u21e1\n3\n\n\u02c6\u21e2X|y\n\n(a) Quantum circuit to\ncompute P (Y )\n\n(c) Alternate circuit to\ncompute P (X|y)\nFigure 1: Quantum-circuit analogues of conditioning in graphical models\n\n(b) Quantum circuit to\ncompute P (X|y)\n\nThe quantum circuit for Bayesian update presented by Srinivasan et al. [2018] is shown in Figure 1b.\nIt involves encoding the prior in \u02c6\u21e2X as before, and encoding the likelihood table A in a unitary matrix\n\u02c6U2. Applying the unitary matrix \u02c6U2 prepares the joint state \u02c6\u21e2XY , and we apply a von Neumann\nprojection operator (denoted \u02c6Py) corresponding to the observation y (the D-shaped symbol in the\ncircuit), to obtain the conditioned state \u02c6\u21e2X|y in the \ufb01rst particle. The projection operator selects the\nentries from the joint distribution \u02c6\u21e2XY that correspond to the actual observation y, and zeroes out\nthe other entries, analogous to using an indicator vector to index into a joint probability table. This\noperation can be written (denominator renormalizes to recover a valid density matrix) as:\n\n\u02c6\u21e2X|y =\n\ntrenv\u21e3Py \u02c6U2 (\u02c6\u21e2X \u2326 \u02c6\u21e2env) \u02c6U\u20202 P \u2020y\u2318\ntr\u21e3trenv\u21e3Py \u02c6U2 (\u02c6\u21e2X \u2326 \u02c6\u21e2env) \u02c6U\u20202 P \u2020y\u2318\u2318\n\n(6)\n\nHowever, there is an alternate quantum circuit that could implement Bayesian conditioning. Consider\nre-writing the classical Bayesian update as a linear update as follows:\n\n1\n\nP (X|y) = (A \u00b7 diag(~\u21e1))T (diag (A~\u21e1))1 ~ey\n\n(7)\nwhere (A \u00b7 diag(~\u21e1))T yields the joint probability table P (X, Y ), (diag (A~\u21e1))1 is a diagonal matrix\nwith the inverse probabilities\nP (Y =y) on the diagonal, serving to renormalize the columns of the\njoint probability table P (X, Y ). Thus, (A \u00b7 diag(~\u21e1))T (diag (A~\u21e1))1 produces a column-stochastic\nmatrix, and ~ey is just an indicator vector that selects the column corresponding to the observation y.\nThen, just as the circuit in Figure 1a is the quantum generalization for Equation 3, we can use the\nquantum circuit shown in 1c for this alternate Bayesian update. Here, \u02c6\u21e2y encodes the indicator vector\ncorresponding to the observation y, and \u02c6U \u21e1\n3 is a unitary matrix constructed using the prior \u21e1 on X.\nLetting \u02c6U \u21e1\n3 to be any unitary matrix constructed from some prior on X would give an alternative\nquantum Bayesian update.\nThese are two different ways of generalizing classical Bayesian rule within quantum graphical models.\nSo which circuit should we use? One major disadvantage of the second approach is that we must\nconstruct different unitary matrices \u02c6U \u21e1\n3 for different priors on X. The \ufb01rst approach also explicitly\ninvolves measurement, which is nicely analogous to classical observation. As we will see in the next\nsection, the two circuits are different ways of performing inference in Hilbert space, with the \ufb01rst\napproach being equivalent to Nadaraya-Watson kernel regression and the second approach being\nequivalent to kernel Bayes rule for Hilbert space embeddings.\n\n3 Translating to the language of Hilbert Space Embeddings\nIn the previous section, we generalized graphical models to quantum graphical models using the\nquantum view of probability. And since quantum states live in complex Hilbert spaces, inference in\nQGMs happens in Hilbert space. Here, we re-write the operations on QGMs in the language of Hilbert\nspace embeddings, which should be more familiar to the statistical machine learning community.\n3.1 Hilbert Space Embeddings\nPrevious work [Smola et al., 2007] has shown that we can embed probability distributions over a data\ndomain X in a reproducing kernel Hilbert space (RKHS) F \u2013 a Hilbert space of functions, with some\nkernel k. The feature map (x) = k(x,\u00b7) maps data points to the RKHS, and the kernel function\nsatis\ufb01es k(x, x0) = h(x), (x0)iF. Additionally, the dot product in the Hilbert space satis\ufb01es the\nreproducing property:\n\nhf (\u00b7), k(x,\u00b7)iF = f (x),\n\nand hk(x,\u00b7), k(x0,\u00b7)iF = k(x, x0)\n\n(8)\n\n3\n\n\f3.1.1 Mean Embeddings\nThe following equations describe how a distribution of a random variable X is embedded as a\nmean map [Smola et al., 2007], and how to empirically estimate the mean map from data points\n{x1, . . . , xn} drawn i.i.d from P (X), respectively:\n\n\u00b5X := EX [(X)]\n\n\u02c6\u00b5X =\n\n(xi)\n\n(9)\n\n1\nn\n\nnXi=1\n\nQuantum Mean Maps We still take the expectation of the features of the data, except we require\nthat the feature maps (\u00b7) produce valid density matrices representing pure states (i.e., rank 1).\nConsequently, quantum mean maps have the nice property of having probabilities along the diagonal.\nNote that these feature maps can be complex and in\ufb01nite, and in the latter case, they map to density\noperators. For notational consistency, we require the feature maps to produce rank-1 vectorized\ndensity matrices (by vertically concatenating the columns of the matrix), and treat the quantum mean\nmap as a vectorized density matrix ~\u00b5X = ~\u21e2X.\n3.1.2 Cross-Covariance Operators\nCross-covariance operators can be used to embed joint distributions; for example, the joint distribution\nof random variables X and Y can be represented as a cross-covariance operator (see Song et al.\n[2013] for more details):\n\nCXY := EXY [(X) \u2326 (Y )]\n\n(10)\n\nQuantum Cross-Covariance Operators The quantum embedding of a joint distribution P (X, Y )\nis a square mn \u21e5 mn density matrix \u02c6\u21e2XY for constituent m \u21e5 m embedding of a sample from P (X)\nand n \u21e5 n embedding of a sample from P (Y ). To obtain a quantum cross-covariance matrix CXY ,\nwe simply reshape \u02c6\u21e2XY to an m2 \u21e5 n2 matrix, which is also consistent with estimating it from data\nas the expectation of outer product of feature maps (\u00b7) that produce vectorized density matrices.\n3.2 Quantum Sum Rule as Kernel Sum Rule\nWe now re-write the quantum sum rule for quantum graphical models from Equation 4, in the\nlanguage of Hilbert space embeddings. Srinivasan et al. [2018] showed that Equation 4 can be written\n\n~\u00b5Y =Xi \u21e3\u21e3Vi \u02c6U W\u2318\u21e4\n\nas \u02c6\u21e2Y =Pi Vi \u02c6U W \u02c6\u21e2XW \u2020 \u02c6U\u2020V \u2020i , where matrices W and V tensor with an environment particle and\npartial trace respectively. Observe that a quadratic matrix operation can be simpli\ufb01ed to a linear\noperation, i.e., \u02c6U \u02c6\u21e2 \u02c6U\u2020 = reshape(( \u02c6U\u21e4 \u2326 \u02c6U )~\u21e2) where ~\u21e2 is the vectorized density matrix \u02c6\u21e2. Then:\n\u2326\u21e3Vi \u02c6U W\u2318! ~\u00b5X = A~\u00b5X\n\n\u2326\u21e3Vi \u02c6U W\u2318\u2318 ~\u00b5X = Xi \u21e3Vi \u02c6U W\u2318\u21e4\n\nwhere A = (Pi(Vi \u02c6U W )\u21e4 \u2326 (Vi \u02c6U W )). We have re-written the complicated transition update as a\nsimple linear operation, though A should have constraints to ensure the operation is valid according to\nquantum mechanics. Consider estimating A from data by solving a least squares problem: suppose we\nhave data (\u2325X, Y ) where 2 Rd1\u21e5n, \u2325 2 Rd2\u21e5n are matrices of n vectorized d1, d2-dimensional\ndensity matrices and n is the number of data points. Solving for A gives us A = Y \u2325\u2020X(\u2325X\u2325\u2020X)1.\nnPn\ni (~\u00b5Yi \u2326 ~\u00b5\u2020Xi\nBut Y \u2325\u2020X = n \u00b7 CY X where CY X = 1\nXX, allowing us to\nre-write Equation 11 as:\n~\u00b5Y = CY XC1\nXX ~\u00b5X\n\n(12)\nBut this is exactly the kernel sum rule from Song et al. [2013], with the conditional embedding\noperator CY |X = CY XC1\nXX. Thus, when the feature maps that produce valid (vectorized) rank-1\ndensity matrices, the quantum sum rule is identical to the kernel sum rule. One thing to note is that\nsolving for A using least-squares needn\u2019t preserve the quantum-imposed constraints; so either the\nlearning algorithm must force these constraints, or we project ~\u00b5Y back to a valid density matrix.\n\n). Then, A = CY XC1\n\n(11)\n\nFinite Sample Kernel Estimate We can straightforwardly adopt the kernelized version of the\nconditional embedding operator from HSEs [Song et al., 2013] ( is a regularization parameter):\n\nCY |X =( Kxx + I)1\u2325\u2020\n\n(13)\nwhere = ( (y1), . . . , (yn)), \u2325= ( (x1), . . . , (xn)), and K =\u2325 \u2020\u2325, and these feature maps\nproduce vectorized rank-1 density matrices. The data points in Hilbert space can be written as\n~\u00b5Y = \u21b5Y and ~\u00b5X =\u2325 \u21b5X where \u21b5 2 Rn are weights for the training data points, and the kernel\nquantum sum rule is simply:\n(14)\n\n~\u00b5Y = CY |X ~\u00b5X ) \u21b5Y =( Kxx + I)1\u2325\u2020\u2325\u21b5X ) \u21b5Y = (Kxx + I)1Kxx\u21b5X\n\n4\n\n\f3.3 Quantum Bayes Rule as Nadaraya-Watson Kernel Regression\nHere, we re-write the Bayesian update for QGMs from Equation 6 in the language of HSEs. First, we\nmodify the quantum circuit in 1b to allow for measurement of a rank-1 density matrix \u02c6\u21e2y in any basis\n(see Appendix for details) to obtain the circuit shown in Figure 2, described by the equation:\n\n\u02c6\u21e2X|y / trenv\u21e3(I \u2326 \u02c6u)P (I \u2326 \u02c6u\u2020) \u02c6U (\u02c6\u21e2X \u2326 \u02c6\u21e2env) \u02c6U\u2020(I \u2326 \u02c6u\u2020)\u2020P \u2020(I \u2326 \u02c6u)\u2020\u2318\n\nwhere \u02c6u changes the basis of the environment variable to one in which the rank-1 density matrix\nencoding the observation \u02c6\u21e2Y is diagonalized to \u21e4 \u2013 a matrix with all zeros except \u21e41,1 = 1. The\nprojection operator will be P = (I \u2326 \u21e4), which means the terms (I \u2326 \u02c6u)P (I \u2326 \u02c6u\u2020) = (I \u2326 \u02c6u)(I \u2326\n\u21e4)(I \u2326 \u02c6u\u2020) = (I \u2326 u\u21e4u\u2020) = (I \u2326 \u02c6\u21e2y), allowing us to rewrite Equation 15 as:\n\u02c6\u21e2X|y / trenv\u21e3(I \u2326 \u02c6\u21e2y) \u02c6U (\u02c6\u21e2X \u2326 \u02c6\u21e2env) \u02c6U\u2020(I \u2326 \u02c6\u21e2y)\u2020\u2318\n\u02c6\u21e2XY = \u02c6U (\u02c6\u21e2X \u2326 \u02c6\u21e2env) \u02c6U\u2020 = \u02c6U W \u02c6\u21e2XW \u2020 \u02c6U\u2020\ntrenv(I \u2326 \u02c6\u21e2y)\u02c6\u21e2XY (I \u2326 \u02c6\u21e2y)\u2020\n\u02c6\u21e2X|y =\ntr (trenv ((I \u2326 \u02c6\u21e2y)\u02c6\u21e2XY (I \u2326 \u02c6\u21e2y)\u2020))\n\nLet us break this equation into two steps:\n\n(16)\n\n(17)\n\n(15)\n\nNow, we re-write the \ufb01rst expression in the language of HSEs. The quadratic matrix operation\ncan be re-written as a linear operation by vectorizing the density matrix as we did in Section\n3.2: ~\u00b5XY = (( \u02c6U W )\u21e4 \u2326 ( \u02c6U W ))~\u00b5X. But for ~\u00b5X 2 Rn2\u21e51, W 2 Rns\u21e5n, and \u02c6U 2 Cns\u21e5ns\nthis operation gives ~\u00b5XY 2 Rn2s2\u21e51, which we can reshape into an n2 \u21e5 s2 matrix C\u21e1X\nXY (the\nsuperscript simply indicates the matrix was composed from a prior on X). We can then directly\nXY = B \u21e53 ~\u00b5X, where B is (( \u02c6U W )\u21e4 \u2326 ( \u02c6U W )) reshaped into a three-mode tensor and \u21e53\nwrite C\u21e1X\nrepresents a tensor contraction along the third mode. But, just as we solved A = CY XC1\nXX in Section\n3.2 we can estimate B = C(XY )XC1\nXX = CXY |X as a matrix and reshape into a 3-mode tensor,\nallowing us to re-write the \ufb01rst step in Equation 17 as:\n(18)\nNow, to simplify the second step, observe that the numerator can be rewritten to get ~\u00b5X|y /\nXY \u02c6\u21e2T\ny \u2326 \u02c6\u21e2y~t, where ~t is a vector of 1s and 0s that carries out the partial trace operation. But, for\nC\u21e1X\na rank 1 density matrix \u02c6\u21e2y, this actually simpli\ufb01es further:\nOne way to renormalize ~\u00b5X|y is to computeCXY |X \u21e53 ~\u00b5X ~\u21e2y and reshape it back into a density\n\nmatrix and divide by its trace. Alternatively, we can rewrite this operation using a vectorized identity\nmatrix ~I that carries out the full trace in the denominator to renormalize as:\n\nXY ~\u21e2y =CXY |X \u21e53 ~\u00b5X ~\u21e2y\n\nXX ~\u00b5X = CXY |X \u21e53 ~\u00b5X\n\nXY = C(XY )XC1\nC\u21e1X\n\n~\u00b5X|y /C \u21e1X\n\n(19)\n\n~\u00b5X|y = CXY |X \u21e53 ~\u00b5X ~\u21e2y\n~ITCXY |X \u21e53 ~\u00b5X ~\u21e2y\n\n(20)\n\nFinite Sample Kernel Estimate We kernelize these operations as follows (where (y) = ~\u21e2y):\n\n~\u00b5X|y =\n\n\u2325 \u00b7 diag (\u21b5X) \u00b7 T (y)\n~IT \u2325 \u00b7 diag (\u21b5X) \u00b7 T (y)\n\n= Pi \u2325i (\u21b5X)i k(yi, y)\nPj (\u21b5X)j k(yj, y)\n\nY )i = (\u21b5X )ik(yi,y)\n\nwhere (\u21b5(X)\n~I carries out the trace operation. As it happens, this method of estimating the conditional embedding\n\nPj (\u21b5X )j k(yj ,y), and ~IT \u2325= 1T since \u2325 contains vectorized density matrices, and\n\n=\u2325 \u21b5(X)\n\nY\n\n(21)\n\n\u02c6\u21e2X\n\n\u02c6\u21e2env\n\n\u02c6U\n\n\u02c6u\u2020\n\n\u02c6\u21e2X|y\n\n\u02c6u\n\nFigure 2: Quantum circuit to compute posterior P (X|y)\n\n5\n\n\f~\u00b5X|y is equivalent to performing Nadaraya-Watson kernel regression [Nadaraya, 1964, Watson,\n1964] from the joint embedding to the kernel embedding. Note that this result only holds for the\nkernels satisfying Equation 4.22 in Wasserman [2006]; importantly, the kernel function must only\noutput positive numbers. One way to enforce this is by using a squared kernel; this is equivalent to\na 2nd-order polynomial expansion of the features or computing the outer product of features. Our\nchoice of feature map produces density matrices (as the outer product of features), so their inner\nproduct in Hilbert space is equivalent to computing the squared kernel, and this constraint is satis\ufb01ed.\n3.4 Quantum Bayes Rule as Kernel Bayes Rule\nAs we discussed at the end of Section 2.2, Figure 1c is an alternate way of generalizing Bayes rule for\nQGMs. But following the same approach of rewriting the quantum circuit in the language of Hilbert\nSpace embeddings as in Section 3.2, we get exactly Kernel Bayes Rule [Song et al., 2013]:\n\n~\u00b5X|y = C\u21e1\n\nXY (C\u21e1\n\nY Y )1(y)\n\n(22)\n\nWhat we have shown thus far As promised, we see that the two different but valid ways of\ngeneralizing Bayes rule for QGMs affects whether we condition according to Kernel Bayes Rule or\nNadaraya-Watson kernel regression. However, we stress that conditioning according to Nadaraya-\nWatson is computationally much easier; the kernel Bayes rule given by Song et al. [2013] using Gram\nmatrices is written:\n\n\u02c6\u00b5X|y =\u2325 DKyy((DKyy)2 + I)1DK:y\n\n(23)\nwhere D = diag((Kxx + I)1Kxx\u21b5X). Observe that this update requires squaring and inverting\nthe Gram matrix Kyy \u2013 an expensive operation. By contrast, performing Bayesian update using\nNadaraya-Watson as per Equation 21 is straightforward. This is one of the key insights of this paper;\nshowing that Nadaraya-Watson kernel regression is an alternate, valid, but simpler way of generalizing\nBayes rule to Hilbert space embeddings. We note that interpreting operations on QGMs as inference\nin Hilbert space is a special case; if the feature maps don\u2019t produce density matrices, we can still\nperform inference in Hilbert space using the quantum/kernel sum rule, and Nadaraya-Watson/kernel\nBayes rule, but lose the probabilistic interpretation of a quantum graphical model.\n4 HSE-HQMMs\nWe now consider mapping data to vectorized density matrices and modeling the dynamics in Hilbert\nspace using a speci\ufb01c quantum graphical model \u2013 hidden quantum Markov models (HQMMs). The\nquantum circuit for HQMMs is shown in Figure 3 [Srinivasan et al., 2018].\n\n\u02c6U1\n\n\u02c6\u21e2t1\n\u02c6\u21e2Xt\n\n\u02c6\u21e2Yt\n\n\u02c6\u21e2t\n\n\u02c6U2\n\n\u02c6u\u2020\n\n\u02c6u\n\nFigure 3: Quantum Circuit for HSE-HQMM\n\nWe use the outer product of random Fourier features (RFF) [Rahimi and Recht, 2008] (which produce\na valid density matrix) to map data to a Hilbert space. \u02c6U1 encodes transition dynamics, \u02c6U2 encodes\nobservation dynamics, and \u02c6\u21e2t is the density matrix after a transition update and conditioning on some\nobservation. The transition and observation equations describing this circuit (with \u02c6UI = I \u2326 \u02c6u\u2020) are:\n\u02c6\u21e20t = tr\u02c6\u21e2t1\u21e3 \u02c6U1 (\u02c6\u21e2t1 \u2326 \u02c6\u21e2Xt) \u02c6U\u20201\u2318 and\n\u02c6U\u2020I\u2318 (24)\n\n\u02c6\u21e2t / trYt\u21e3 \u02c6UIPy \u02c6U\u2020I\n\nAs we saw in the previous section, we can rewrite these in the language of Hilbert Space embeddings:\n\n\u02c6U2 (\u02c6\u21e20t \u2326 \u02c6\u21e2Yt) \u02c6U\u20202\n\n\u02c6UIP \u2020y\n\nAnd the kernelized version of these operations (where \u2325= ( (x1), . . . , (xn)) is (see appendix):\n\n~\u00b50xt = (Cxtxt1C1\n\nxt1xt1)~\u00b5xt1\n\nand\n\n\u21b50xt = (Kxt1xt1 + I)1Kxt1xt\u21b5xt1\n\n6\n\n~\u00b5t =\n\n(Cxtyt|xt \u21e53 ~\u00b50xt)(yt)\n~IT (Cxtyt|xt \u21e53 ~\u00b50xt)(yt)\nand \u21b5xt = Pi \u2325i\u21b50xti k(yi, y)\nPj\u21b50xtj k(yj, y)\n\n(25)\n\n(26)\n\n\fIt is also possible to combine the operations setting Cxtyt|xt1 = Cxtyt|xtCxtxt1C1\nour single update in Hilbert space:\n\nxt1xt1 to write\n\n~\u00b5xt =\n\n(Cxtyt|xt1 \u21e53 ~\u00b5xt1)(yt)\n~IT (Cxtyt|xt1 \u21e53 ~\u00b5xt1)(yt)\n\n(27)\n\nMaking Predictions As discussed in Srinivasan et al. [2018], conditioning on some discrete-\nvalued observation y in the quantum model produces an unnormalized density matrix whose trace\nis the probability of observing y. However, in the case of continuous-valued observations, we can\ngo further and treat this trace as the unnormalized density of the observation yt, i.e., fY (yt) /\n~IT (Cxtyt|xt1 \u21e53 ~\u00b50t1)(yt) \u2013 the equivalent operation in the language of quantum circuits is the\ntrace of the unnormalized \u02c6\u21e2t shown in Figure 3. A bene\ufb01t of building this model using the quantum\nformalism is that we can immediately see that this trace is bounded and lies in [0, 1]. It is also\nstraightforward to see that a tighter bound for the unnormalized densities is given by the largest\nand smallest eigenvalues of the reduced density matrix \u02c6\u21e2Yt = trXt(\u02c6\u21e2XtYt) where \u02c6\u21e2XtYt is the joint\ndensity matrix after the application of \u02c6U2.\nTo make a prediction, we sample from the convex hull of our training set, compute densities as\ndescribed, and take the expectation to make a point prediction. This formalism is potentially powerful\nas it lets us maintain a whole distribution over the outputs (e.g. Figure 5), instead of just a point\nestimate for the next prediction as with LSTMs. A deeper investigation of the density estimation\nproperties of our model would be an interesting direction for future work.\n\nLearning HSE-HQMMs We estimate model parameters using 2-stage regression (2SR) [Hefny\net al., 2015], and re\ufb01ne them using back-propagation through time (BPTT). With this approach, the\nlearned parameters are not guaranteed to satisfy the quantum constraints, and we handle this by\nprojecting the state back to a valid density matrix at each time step. Details are given in Algorithm 1\n\nAlgorithm 1 Learning Algorithm using Two-Stage Regression for HSE-HQMMs\nInput: Data as y1:T = y1, ..., yT\nOutput: Cross-covariance matrix Cxtyt|xt1, can be reshaped into 3-mode tensor for prediction\n1: Compute features of the past (h), future (f), shifted future (s) from data (with window k):\n\nht = h(ytk:t1)\n\nft = f (yt:t+k)\n\nst = f (yt+1:t+k+1)\n\n2: Project data and features of past, future, shifted future into RKHS using random Fourier features\n\nof desired kernel (feature map (\u00b7)) to generate quantum systems:\n|fit f (ft)\n\n|hit h(ht)\n\n|yit y(yt)\n\n3: Construct density matrices in the RKHS and vectorize them:\n\n|sit f (st)\n\n~\u21e2(y)\nt = vec (|yithy|t)\n\n~\u21e2(s)\nt = vec (|siths|t)\n4: Compose matrices whose columns are the vectorized density matrices, for each available time-\n\n~\u21e2(h)\nt = vec (|hithh|t)\n\n~\u21e2(f )\nt = vec (|fithf|t)\n\nstep (accounting for window size k), denoted y, \u2325h, \u2325f , and \u2325s respectively.\n\n5: Obtain extended future via tensor product ~\u21e2(s,y)\n6: Perform Stage 1 regression\n\nand collect into matrix s,y.\n\nt\n\n7: Use operators from stage 1 regression to obtain denoised predictive quantum states:\n\nt ~\u21e2(s)\n\nt \u2326 ~\u21e2(y)\nCf|h \u2325f \u2325\u2020h\u21e3\u2325h\u2325\u2020h + \u23181\nCs,y|h s,y\u2325\u2020h\u21e3\u2325h\u2325\u2020h + \u23181\n\u02dc\u2325f|h C f|h\u2325h\n\u02dcs,y|h C s,y|h\u2325h\nCxtyt|xt1 \u02dcs,y|h \u02dc\u2325\u2020f|h\u21e3 \u02dc\u2325f|h \u02dc\u2325\u2020f|h + \u23181\n\n8: Perform Stage 2 regression to obtain model parameters\n\n7\n\n\f5 Comparison with Previous Work\n5.1 HQMMs\nSrinivasan et al. [2018] present a maximum-likelihood learning algorithm to estimate the parameters\nof a HQMM from data. However, it is very limited in its scope; the algorithm is slow and doesn\u2019t\nscale for large datasets. In this paper, we leverage connections to HSEs, kernel methods, and RFFs to\nachieve a more practical and scalable learning algorithm for these models. However, one difference to\nnote is that the algorithm presented by Srinivasan et al. [2018] guaranteed that the learned parameters\nwould produce valid quantum operators, whereas our algorithm only approximately produces valid\nquantum operators; we will need to project the updated state back to the nearest quantum state to\nensure that we are tracking a valid quantum system.\n\n5.2 PSRNNs\nPredictive State Recurrent Neural Networks (PSRNNs) [Downey et al., 2017] are a recent state-of-\nthe-art model developed by embedding a Predictive State Representation (PSR) into an RKHS. The\nPSRNN update equation is:\n\n~!t =\n\nW \u21e53 ~!t1 \u21e52 (yt)\n||W \u21e53 ~!t1 \u21e52 (yt)||F\n\n(28)\n\nwhere W is a three mode tensor corresponding to the cross-covariance between observations and\nthe state at time t conditioned on the state at time t 1, and ! is a factorization of a p.s.d state\nmatrix \u00b5t = !!T (so renormalizing ! by Frobenius norm is equivalent to renormalizing \u00b5 by its\ntrace). There is a clear connection between PSRNNs and the HSE-HQMMs; this matrix \u00b5t is what\nwe vectorize to use as our state ~\u00b5t in HSE-HQMMs, and both HSE-HQMMs and PSRNNs are\nparameterized (in the primal space using RFFs) in terms of a three-mode tensor (W for PSRNNs\nand (Cxtyt|xt1 for HSE-HQMMs). We also note that while PSRNNs modi\ufb01ed kernel Bayes rule\n(from Equation 22) heuristically, we have shown that this modi\ufb01cation can be interpreted as a\ngeneralization of Bayes rule for QGMs or Nadaraya-Watson kernel regression. One key difference\nbetween these approaches is that we directly use states in Hilbert space to estimate the probability\ndensity of observations; in other words HSE-HQMMs are a generative model. By contrast PSRNNs\nare a discriminative model which rely on an additional ad-hoc mapping from states to observations.\n6 Experiments\nWe use the following datasets in our experiments1:\n\n\u2022 Penn Tree Bank (PTB) Marcus et al. [1993]. We train a character-prediction model with a\ntrain/test split of 120780/124774 characters due to hardware limitations.\n\u2022 Swimmer Simulated swimmer robot from OpenAI gym2. We collect 25 trajectories from a\nrobot that is trained to swim forward (via the cross entropy with a linear policy) with a 20/5\ntrain/test split. There are 5 features at each time step: the angle of the robots nose, together\nwith the 2D angles for each of it\u2019s joints.\n\u2022 Mocap Human Motion Capture Dataset. We collect 48 skeletal tracks from three human\nsubjects with a 40/8 train/test split. There are 22 total features at each time step: the 3D\npositions of the skeletal parts (e.g., upper back, thorax, clavicle).\n\nWe compare the performance of three models: HSE-HQMMs, PSRNNs, and LSTMs. We initialize\nPSRNNs and HSE-HQMMs using Two-Stage Regression (2SR) [Downey et al., 2017] and LSTMs\nusing Xavier Initialization and re\ufb01ne all three models using Back Propagation Through Time (BPTT).\nWe optimize and evaluate all models on Swimmer and Mocap with respect to the Mean Squared\nError (MSE) using 10 step predictions as is conventional in the robotics community. This means that\nto evaluate the model we perform recursive \ufb01ltering on the test set to produce states, then use these\nstates to make predictions about observations 10 steps in the future. We optimize all models on PTB\nwith respect to Perplexity (Cross Entropy) using 1 step predictions, as is conventional in the NLP\ncommunity. As we can see in Figure 4, HSE-HQMMs outperform both PSRNNs and LSTMs on the\nswimmer dataset, and achieve comparable performance to the best alternative on Mocap and PTB.\nHyperparameters and other experimental details can be found in Appendix E.\n\n1Code will be made available at https://github.com/cmdowney/hsehqmm\n2https://gym.openai.com/\n\n8\n\n\fFigure 4: Performance of HSE-HQMM, PSRNN, and LSTM on Mocap, Swimmer, and PTB\n\nVisualizing Probability Densities As mentioned previously, HSE-HQMMs can maintain a prob-\nability density function over future observations, and we visualize this for a model trained on the\nMocap dataset in Figure 5. We take the 22 dimensional joint density and marginalize it to produce\nthree marginal distributions, each over a single feature. We plot the resulting marginal distributions\nover time using a heatmap, and superimpose the ground-truth and model predictions. We observe\nthat BPTT (second row) improves the marginal distribution. Another interesting observation, from\nthe the last \u21e030 timesteps of the marginal distribution in the top-left image, is that our model is\nable to produce a bi-modal distribution with probability mass at both yi = 1.5 and yi = 0.5,\nwithout making any parametric assumptions. This kind of information is dif\ufb01cult to obtain using a\ndiscriminative model such as a LSTM or PSRNN. Additional heatmaps can be found in Appendix D.\n\nFigure 5: Heatmap Visualizing the Probability Densities generated by our HSE-HQMM model.\nRed indicates high probability, blue indicates low probability, x-axis corresponds to time, y-axis\ncorresponds to the feature value. Each column corresponds to the predicted marginal distribution of a\nsingle feature changing with time. The \ufb01rst row is the probability distribution after 2SR initialization,\nthe second row is the probability distribution after the model in row 1 has been re\ufb01ned via BPTT.\n7 Conclusion and Future Work\nWe explored the connections between QGMs and HSEs, and showed that the sum rule and Bayes rule\nin QGMs is equivalent to kernel sum rule and a special case of Nadaraya-Watson kernel regression.\nWe proposed HSE-HQMMs to model dynamics, and showed experimentally that these models are\ncompetitive with LSTMs and PSRNNs on making point predictions, while also being a nonparametric\nmethod for maintaining a probability distribution over continuous-valued features. Looking forward,\nwe note that our experiments only consider real kernels/features, so we are not utilizing the full\ncomplex Hilbert space; it would be interesting to investigate whether incorporating complex numbers\nimproves our model. Additionally, by estimating parameters using least-squares, the parameters\nonly approximately adhere to quantum constraints. The \ufb01nal model also bears strong resemblance to\nPSRNNs [Downey et al., 2017]. It would be interesting to investigate both what happens if we are\nstricter about enforcing quantum constraints, and if we give the model greater freedom to drift from\nthe quantum constraints. Finally, the density estimation properties of the model are also an avenue\nfor future exploration.\n\n9\n\n\fReferences\nB. Boots, A. Gretton, and G. J. Gordon. Hilbert space embeddings of PSRs. NIPS Workshop on Spectral\n\nAlgorithms for Latent Variable Models, 2012.\n\nB. Boots, G. J. Gordon, and A. Gretton. Hilbert space embeddings of predictive state representations. CoRR,\n\nabs/1309.6819, 2013. URL http://arxiv.org/abs/1309.6819.\n\nC. Downey, A. Hefny, B. Li, B. Boots, and G. J. Gordon. Predictive state recurrent neural networks.\n\nProceedings of Advances in Neural Information Processing Systems (NIPS), 2017.\n\nIn\n\nA. Hefny, C. Downey, and G. J. Gordon. Supervised learning for dynamical system learning. In Advances in\n\nneural information processing systems, pages 1963\u20131971, 2015.\n\nM. Leifer and D. Poulin. Quantum graphical models and belief propagation. Ann. Phys., 323:1899, 2008.\n\nM. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn\n\ntreebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\nA. Monras, A. Beige, and K. Wiesner. Hidden quantum Markov models and non-adaptive read-out of many-body\n\nstates. arXiv preprint arXiv:1002.2337, 2010.\n\nE. A. Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141\u2013142, 1964.\n\nM. A. Nielsen and I. Chuang. Quantum computation and quantum information, 2002.\n\nA. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in neural information\n\nprocessing systems, pages 1177\u20131184, 2008.\n\nM. Schuld and N. Killoran. Quantum machine learning in feature Hilbert spaces.\n\narXiv:1803.07128, 2018.\n\narXiv preprint\n\nA. Smola, A. Gretton, L. Song, and B. Sch\u00f6lkopf. A Hilbert space embedding for distributions. In International\n\nConference on Algorithmic Learning Theory, pages 13\u201331. Springer, 2007.\n\nL. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions with\napplications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine\nLearning, pages 961\u2013968. ACM, 2009.\n\nL. Song, B. Boots, S. M. Siddiqi, G. J. Gordon, and A. J. Smola. Hilbert space embeddings of hidden Markov\n\nmodels. In Proceedings of the 27th International Conference on Machine Learning (ICML), 2010.\n\nL. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of conditional distributions: A uni\ufb01ed kernel\nframework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 30(4):\n98\u2013111, 2013.\n\nS. Srinivasan, G. J. Gordon, and B. Boots. Learning hidden quantum Markov models. In Proceedings of the 21st\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 2018.\n\nL. Wasserman. All of Nonparametric Statistics (Springer Texts in Statistics). Springer-Verlag, Berlin, Heidelberg,\n\n2006. ISBN 0387251456.\n\nG. S. Watson. Smooth regression analysis. Sankhy\u00afa: The Indian Journal of Statistics, Series A, pages 359\u2013372,\n\n1964.\n\nC.-H. Yeang. A probabilistic graphical model of quantum systems. In Machine Learning and Applications\n\n(ICMLA), 2010 Ninth International Conference on, pages 155\u2013162. IEEE, 2010.\n\n10\n\n\f", "award": [], "sourceid": 6611, "authors": [{"given_name": "Siddarth", "family_name": "Srinivasan", "institution": "Georgia Institute of Technology"}, {"given_name": "Carlton", "family_name": "Downey", "institution": "Carnegie Mellon University"}, {"given_name": "Byron", "family_name": "Boots", "institution": "Georgia Tech / Google Brain"}]}