{"title": "Learning HMMs with Nonparametric Emissions via Spectral Decompositions of Continuous Matrices", "book": "Advances in Neural Information Processing Systems", "page_first": 2865, "page_last": 2873, "abstract": "Recently, there has been a surge of interest in using spectral methods for estimating latent variable models. However, it is usually assumed that the distribution of the observations conditioned on the latent variables is either discrete or belongs to a parametric family. In this paper, we study the estimation of an $m$-state hidden Markov model (HMM) with only smoothness assumptions, such as H\\\"olderian conditions, on the emission densities. By leveraging some recent advances in continuous linear algebra and numerical analysis, we develop a computationally efficient spectral algorithm for learning nonparametric HMMs. Our technique is based on computing an SVD on nonparametric estimates of density functions by viewing them as \\emph{continuous matrices}. We derive sample complexity bounds via concentration results for nonparametric density estimation and novel perturbation theory results for continuous matrices. We implement our method using Chebyshev polynomial approximations. Our method is competitive with other baselines on synthetic and real problems and is also very computationally efficient.", "full_text": "Learning HMMs with Nonparametric Emissions via\n\nSpectral Decompositions of Continuous Matrices\n\nKirthevasan Kandasamy\u21e4\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nkandasamy@cs.cmu.edu\n\nalshedivat@cs.cmu.edu\n\nMaruan Al-Shedivat\u21e4\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nEric P. Xing\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nepxing@cs.cmu.edu\n\nAbstract\n\nRecently, there has been a surge of interest in using spectral methods for estimating\nlatent variable models. However, it is usually assumed that the distribution of the\nobservations conditioned on the latent variables is either discrete or belongs to\na parametric family. In this paper, we study the estimation of an m-state hidden\nMarkov model (HMM) with only smoothness assumptions, such as H\u00f6lderian\nconditions, on the emission densities. By leveraging some recent advances in\ncontinuous linear algebra and numerical analysis, we develop a computationally\nef\ufb01cient spectral algorithm for learning nonparametric HMMs. Our technique is\nbased on computing an SVD on nonparametric estimates of density functions by\nviewing them as continuous matrices. We derive sample complexity bounds via\nconcentration results for nonparametric density estimation and novel perturbation\ntheory results for continuous matrices. We implement our method using Chebyshev\npolynomial approximations. Our method is competitive with other baselines on\nsynthetic and real problems and is also very computationally ef\ufb01cient.\n\n1\n\nIntroduction\n\nHidden Markov models (HMMs) [1] are one of the most popular statistical models for analyzing time\nseries data in various application domains such as speech recognition, medicine, and meteorology. In\nan HMM, a discrete hidden state undergoes Markovian transitions from one of m possible states to\nanother at each time step. If the hidden state at time t is ht, we observe a random variable xt 2X\ndrawn from an emission distribution, Oj = P(xt|ht = j). In its most basic form X is a discrete set\nand Oj are discrete distributions. When dealing with continuous observations, it is conventional to\nassume that the emissions Oj belong to a parametric class of distributions, such as Gaussian.\nRecently, spectral methods for estimating parametric latent variable models have gained immense\npopularity as a viable alternative to the Expectation Maximisation (EM) procedure [2\u20134]. At a high\nlevel, these methods estimate higher order moments from the data and recover the parameters via\na series of matrix operations such as singular value decompositions, matrix multiplications and\npseudo-inverses of the moments. In the case of discrete HMMs [2], these moments correspond\nexactly to the joint probabilities of the observations in the sequence.\nAssuming parametric forms for the emission densities is often too restrictive since real world\ndistributions can be arbitrary. Parametric models may introduce incongruous biases that cannot be\nreduced even with large datasets. To address this problem, we study nonparametric HMMs only\nassuming some mild smoothness conditions on the emission densities. We design a spectral algorithm\nfor this setting. Our methods leverage some recent advances in continuous linear algebra [5, 6]\nwhich views two-dimensional functions as continuous analogues of matrices. Chebyshev polynomial\napproximations enable ef\ufb01cient computation of algebraic operations on these continuous objects [7,\n8]. Using these ideas, we extend existing spectral methods for discrete HMMs to the continuous\nnonparametric setting. Our main contributions are:\n\n\u21e4Joint lead authors.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f1. We derive a spectral learning algorithm for HMMs with nonparametric emission densities. While\nthe algorithm is similar to previous spectral methods for estimating models with a \ufb01nite number\nof parameters, many of the ideas used to generalise it to the nonparametric setting are novel, and,\nto the best of our knowledge, have not been used before in the machine learning literature.\n\n2. We establish sample complexity bounds for our method. For this, we derive concentration results\nfor nonparametric density estimation and novel perturbation theory results for the aforementioned\ncontinuous matrices. The perturbation results are new and might be of independent interest.\n\n3. We implement our algorithm by approximating the density estimates via Chebyshev polynomials\nwhich enables ef\ufb01cient computation of many of the continuous matrix operations. Our method out-\nperforms natural competitors in this setting on synthetic and real data and is computationally more\nef\ufb01cient than most of them. Our Matlab code is available at github.com/alshedivat/nphmm.\n\nWhile we focus on HMMs in this exposition, we believe that the ideas presented in this paper can be\neasily generalised to estimating other latent variable models and predictive state representations [9]\nwith nonparametric observations using approaches developed by Anandkumar et al. [3].\nRelated Work: Parametric HMMs are usually estimated using maximum likelihood principle via EM\ntechniques [10] such as the Baum-Welch procedure [11]. However, EM is a local search technique,\nand optimization of the likelihood may be dif\ufb01cult. Hence, recent work on spectral methods has\ngained appeal. Our work builds on Hsu et al. [2] who showed that discrete HMMs can be learned\nef\ufb01ciently, under certain conditions. The key idea is that any HMM can be completely characterised\nin terms of quantities that depend entirely on the observations, called the observable representation,\nwhich can be estimated from data. Siddiqi et al. [4] show that the same algorithm works under slightly\nmore general assumptions. Anandkumar et al. [3] proposed a spectral algorithm for estimating more\ngeneral latent variable models with parametric observations via a moment matching technique.\nThat said, we are aware of little work on estimating latent variable models, including HMMs, when\nthe observations are nonparametric. A commonly used heuristic is the nonparametric EM [12], which\nlacks theoretical underpinnings. This should not be surprising because EM is degenerate for most\nnonparametric problems as a maximum likelihood procedure [13]. Only recently, De Castro et al.\n[14] have provided a minimax-type of result for the nonparametric setting. In their work, Siddiqi et al.\n[4] proposed a heuristic based on kernel smoothing, to modify the discrete algorithm for continuous\nobservations. Further, their procedure cannot be used to recover the joint or conditional probabilities\nof a sequence, which would be needed to compute probabilities of events and other inference tasks.\nSong et al. [15, 16] developed an RKHS-based procedure for estimating the Hilbert space embedding\nof an HMM. While they provide theoretical guarantees, their bounds are in terms of the RKHS\ndistance of the true and estimated embeddings. This metric depends on the choice of the kernel and it\nis not clear how it translates to a suitable distance measure on the observation space such as an L1 or\nL2 distance. While their method can be used for prediction and pairwise testing, it cannot recover the\njoint and conditional densities. On the contrary, our model provides guarantees in terms of the more\ninterpretable total variation distance and is able to recover the joint and conditional probabilities.\n\n2 A Pint-sized Review of Continuous Linear Algebra\nWe begin with a pint-sized review on continuous linear algebra which treats functions as continuous\nanalogues of matrices. Appendix A contains a quart-sized review. Both sections are based on [5, 6].\nWhile these objects can be viewed as operators on Hilbert spaces which have been studied extensively\nin the years, the above line of work simpli\ufb01ed and specialised the ideas to functions.\nA matrix F 2 Rm\u21e5n is an m\u21e5n array of numbers where F (i, j) denotes the entry in row i, column j.\nm or n could be (countably) in\ufb01nite. A column qmatrix (quasi-matrix) Q 2 R[a,b]\u21e5m is a collection\nof m functions de\ufb01ned on [a, b] where the row index is continuous and column index is discrete.\nWriting Q = [q1, . . . , qm] where qj : [a, b] ! R is the jth function, Q(y, j) = qj(y) denotes the value\nof the jth function at y 2 [a, b]. Q> 2 Rm\u21e5[a,b] denotes a row qmatrix with Q>(j, y) = Q(y, j).\nA cmatrix (continuous-matrix) C 2 R[a,b]\u21e5[c,d] is a two dimensional function where both row and\ncolumn indices are continuous and C(y, x) is the value of the function at (y, x) 2 [a, b] \u21e5 [c, d].\nC> 2 R[c,d]\u21e5[a,b] denotes its transpose with C>(x, y) = C(y, x). Qmatrices and cmatrices permit\nall matrix multiplications with suitably de\ufb01ned inner products. For example, if R 2 R[c,d]\u21e5m and\nC 2 R[a,b]\u21e5[c,d], then CR = T 2 R[a,b]\u21e5m where T (y, j) =R d\n\nc C(y, s)R(s, j)ds.\n\n2\n\n\fA cmatrix has a singular value decomposition (SVD). If C 2 R[a,b]\u21e5[c,d], it decomposes as an\nin\ufb01nite sum, C(y, x) = P1j=1 juj(y)vj(x), that converges in L2. Here 1 2 \u00b7\u00b7\u00b7 0\nare the singular values of C. {uj}j1 and {vj}j1 are functions that form orthonormal bases for\nL2([a, b]) and L2([c, d]), respectively. We can write the SVD as C = U \u2303V > by writing the singular\nvectors as in\ufb01nite qmatrices U = [u1, u2 . . . ], V = [v1, v2 . . . ], and \u2303 = diag(1, 2 . . . ). If only\nm < 1 \ufb01rst singular values are nonzero, we say that C is of rank m. The SVD of a qmatrix\nQ 2 R[a,b]\u21e5m is, Q = U \u2303V > where U 2 R[a,b]\u21e5m and V 2 Rm\u21e5m have orthonormal columns\nand \u2303 = diag(1, . . . , m) with 1 \u00b7\u00b7\u00b7 m 0. The rank of a column qmatrix is the number\nof linearly independent columns (i.e. functions) and is equal to the number of nonzero singular\nvalues. Finally, as for the \ufb01nite matrices, the pseudo inverse of the cmatrix C is C\u2020 = V \u23031U>\nwith \u23031 = diag(1/1, 1/2, . . . ). The pseudo inverse of a qmatrix is de\ufb01ned similarly.\n\n3 Nonparametric HMMs and the Observable Representation\n\n+\n\n.\n\nNotation: Throughout this manuscript, we will use P to denote probabilities of events while p will\ndenote probability density functions (pdf). An HMM characterises a probability distribution over a\nsequence of hidden states {ht}t0 and observations {xt}t0. At a given time step, the HMM can\nbe in one of m hidden states, i.e. ht 2 [m] = {1, . . . , m}, and the observation is in some bounded\ncontinuous domain X . Without loss of generality, we take2 X = [0, 1]. The nonparametric HMM\nwill be completely characterised by the initial state distribution \u21e1 2 Rm, the state transition matrix\nT 2 Rm\u21e5m and the emission densities Oj : X! R, j 2 [m]. \u21e1i = P(h1 = i) is the probability\nthat the HMM would be in state i at the \ufb01rst time step. The element T (i, j) = P(ht+1 = i|ht = j)\nof T gives the probability that a hidden state transitions from state j to state i. The emission\nfunction, Oj : X! R+, describes the pdf of the observation conditioned on the hidden state j, i.e.\nOj(s) = p(xt = s|ht = j). Note that we have Oj(x) > 0, 8x andR Oj(\u00b7) = 1 for all j 2 [m]. In\nthis exposition, we denote the emission densities by the qmatrix, O = [O1, . . . , Om] 2 R[0,1]\u21e5m\nIn addition, let eO(x) = diag(O1(x), . . . , Om(x)), and A(x) = TeO(x). Let x1:t = {x1, . . . , xt}\nbe an ordered sequence and xt:1 = {xt, . . . , x1} denote its reverse. For brevity, we will overload\nnotation for A for sequences and write A(xt:1) = A(xt)A(xt1) . . . A(x1). It is well known [2, 17]\nthat the joint probability density of the sequence x1:t can be computed via p(x1:t) = 1>mA(xt:1)\u21e1.\nKey structural assumption: Previous work on estimating HMMs with continuous observations\ntypically assumed that the emissions, Oj, take a parametric form, e.g. Gaussian. Unlike them, we\nonly make mild nonparametric smoothness assumptions on Oj. As we will see, to estimate the HMM\nwell in this problem we will need to estimate entire pdfs well. For this reason, the nonparametric\nsetting is signi\ufb01cantly more dif\ufb01cult than its parametric counterpart as the latter requires estimating\nonly a \ufb01nite number of parameters. When compared to the previous literature, this is the crucial\ndistinction and the main challenge in this work.\nObservable Representation: The observable representation is a description of an HMM in terms of\nquantities that depend on the observations [17]. This representation is useful for two reasons: (i) it\ndepends only on the observations and can be directly estimated from the data; (ii) it can be used to\ncompute joint and conditional probabilities of sequences even without the knowledge of T and O and\ntherefore can be used for inference and prediction. First, we de\ufb01ne the joint densities, P1, P21, P321:\nP1(t) = p(x1 = t), P21(s, t) = p(x2 = s, x1 = t), P321(r, s, t) = p(x3 = r, x2 = s, x1 = t),\nwhere xi, i = 1, 2, 3 denotes the observation at time i. Denote P3x1(r, t) = P321(r, x, t) for all x. We\nwill \ufb01nd it useful to view both P21, P3x1 2 R[0,1]\u21e5[0,1] as cmatrices. We will also need an additional\nqmatrix U 2 R[0,1]\u21e5m such that U>O 2 Rm\u21e5m is invertible. Given one such U, the observable\nrepresentation of an HMM is described by the parameters b1, b1 2 Rm and B : [0, 1] ! Rm\u21e5m,\n(1)\nAs before, for a sequence, xt:1 = {xt, . . . , x1}, we de\ufb01ne B(xt:1) = B(xt)B(xt1) . . . B(x1). The\nfollowing lemma shows that the \ufb01rst m left singular vectors of P21 are a natural choice for U.\nLemma 1. Let \u21e1> 0, T and O be of rank m and U be the qmatrix composed of the \ufb01rst m left\nsingular vectors of P21. Then U>O is invertible.\n\nB(x) = (U>P3x1)(U>P21)\u2020\n\nb1 = (P >21U )\u2020P1,\n\nb1 = U>P1,\n\n2 We discuss the case of higher dimensions in Section 7.\n\n3\n\n\fTo compute the joint and conditional probabilities using the observable representation, we maintain\nan internal state, bt, which is updated as we see more observations. The internal state at time t is\n\nbt =\n\nB(xt1:1)b1\nb>1B(xt1:1)b1\n\n.\n\n(2)\n\nThis de\ufb01nition of bt is consistent with b1. The following lemma establishes the relationship between\nthe observable representation and the internal states to the HMM parameters and probabilities.\n\nLemma 2 (Properties of the Observable Representation). Let rank(T ) = rank(O) = m and U>O\nbe invertible. Let p(x1:t) denote the joint density of a sequence x1:t and p(xt+1:t+t0|x1:t) denote the\nconditional density of xt+1:t+t0 given x1:t in a sequence x1:t+t0. Then the following are true.\n1. b1 = U>O\u21e1\n2. b1 = 1>m(U>O)1\n3. B(x) = (U>O)A(x)(U>O)1 8 x 2 [0, 1].\nThe last two claims of the Lemma 2 show that we can use the observable representation for computing\nthe joint and conditional densities. The proofs of Lemmas 1 and 2 are similar to the discrete case and\nmimic Lemmas 2, 3 & 4 of Hsu et al. [2].\n\n4. bt+1 = B(xt)bt/(b>1B(xt)bt).\n5. p(x1:t) = b>1B(xt:1)b1.\n6. p(xt+t0:t+1|x1:t) = b>1B(xt+t0:t+1)bt.\n\n4 Spectral Learning of HMMs with Nonparametric Emissions\nThe high level idea of our algorithm, NP-HMM-SPEC, is as follows. First we will obtain density\nestimates for P1, P21, P321 which will then be used to recover the observable representation b1, b1, B\nby plugging in the expressions in (1). Lemma 2 then gives us a way to estimate the joint and\nconditional probability densities. For now, we will assume that we have N i.i.d sequences of triples\n3 ) are the observations at the \ufb01rst three time steps. We\n{X (j)}N\ndescribe learning from longer sequences in Section 4.3.\n\nj=1 where X (j) = (X (j)\n\n2 , X (j)\n\n1 , X (j)\n\n4.1 Kernel Density Estimation\nThe \ufb01rst step is the estimation of the joint probabilities which requires a nonparametric density\nestimate. While there are several techniques [18], we use kernel density estimation (KDE) since it is\neasy to analyse and works well in practice. The KDE for P1, P21, and P321 take the form:\n\n1\n\n1\nN\n\n1\nh1\n\nNXj=1\nbP21(s, t) =\nh321 ! K s X (j)\n\nK s X (j)\nh321 ! K t X (j)\nh321 ! .\n\nNXj=1\nbP1(t) =\nbP321(r, s, t) =\nHere K : [0, 1] ! R is a symmetric function called a smoothing kernel and satis\ufb01es (at the very\nleast)R 1\n0 sK(s)ds = 0. The parameters h1, h21, h321 are the bandwidths, and are\ntypically decreasing with N. In practice they are usually chosen via cross-validation.\n\nh1 ! ,\nK t X (j)\nK r X (j)\nNXj=1\n0 K(s)ds = 1,R 1\n\nh21 ! K t X (j)\nh21 ! ,\n\n1\nh3\n321\n\n1\nh2\n21\n\n1\nN\n\n2\n\n2\n\n1\n\n1\nN\n\n3\n\n1\n\n(3)\n\n1 , X (j)\n\n4.2 The Spectral Algorithm\nAlgorithm 1 NP-HMM-SPEC\nInput: Data {X (j) = (X (j)\n\u2022 Obtain estimates bP1,bP21,bP321 for P1, P21, P321 via kernel density estimation (3).\n\u2022 Compute the cmatrix SVD of bP21. Let bU 2 R[0,1]\u21e5m be the \ufb01rst m left singular vectors of bP21.\n\u2022 Compute the parameters observable representation. Note that bB is a Rm\u21e5m valued function.\n\nj=1, number of states m.\n\n2 , X (j)\n\n3 )}N\n\nbB(x) = (bU>bP3x1)(bU>bP21)\u2020\n\nbb1 = bU>bP1,\n\nbb1 = (P >21bU )\u2020bP1,\n\n4\n\n\fThe algorithm, given above in Algorithm 1, follows the roadmap set out at the beginning of this\nsection. While the last two steps are similar to the discrete HMM algorithm of Hsu et al. [2], the\n\nSVD, pseudoinverses and multiplications are with q/c-matrices. Once we have the estimatesbb1,bb1,\nand bB(x) the joint and predictive (conditional) densities can be estimated via (see Lemma 2):\nHerebbt is the estimated internal state obtained by plugging inbb1,bb1,bB in (2). Theoretically, these\n\nestimates can be negative in which case they can be truncated to 0 without affecting the theoretical\nresults in Section 5. However, in our experiments these estimates were never negative.\n\nbp(xt+t0:t+1|x1:t) =bb>\n\n1bB(xt+t0:t+1)bbt.\n\nbp(x1:t) =bb>\n\n1bB(xt:1)bb1,\n\n(4)\n\nImplementation Details\n\n4.3\nC/Q-Matrix operations using Chebyshev polynomials: While our algorithm and analysis are\nconceptually well founded, the important practical challenge lies in the ef\ufb01cient computation of the\nmany aforementioned operations on c/q-matrices. Fortunately, some very recent advances in the\nnumerical analysis literature, speci\ufb01cally on computing with Chebyshev polynomials, have rendered\nthe above algorithm practical [6, Ch.3-4]. Due to the space constraints, we provide only a summary.\nChebyshev polynomials is a family of orthogonal polynomials on compact intervals, known to be\nan excellent approximator of one-dimensional functions [19, 20]. A recent line of work [5, 8] has\nextended the Chebyshev technology to two dimensional functions enabling the mentioned operations\nand factorisations such as QR, LU and SVD [6, Sections 4.6-4.8] of continuous matrices to be carried\n\nwithin machine precision. Our implementation makes use of the Chebfun library [7] which provides\nan ef\ufb01cient implementation for the operations on continuous and quasi matrices.\n\nef\ufb01ciently. The density estimates bP1,bP21,bP321 are approximated by Chebyshev polynomials to\nComputation time: Representing the KDE estimates bP1,bP21,bP321 using Chebfun was roughly\nlinear in N and is the brunt of the computational effort. The bandwidths for the three KDE estimates\nare chosen via cross validation which takes O(N 2) effort. However, in practice the cost was\ndominated by the Chebyshev polynomial approximation. In our experiments we found that NP-\nHMM-SPEC runs in linear time in practice and was more ef\ufb01cient than most alternatives.\nTraining with longer sequences: When training with longer sequences we can use a sliding window\nof length 3 across the sequence to create the triples of observations needed for the algorithm. That\nis, given N samples each of length `(j), j = 1, . . . , N, we create an augmented dataset of triples\nj=1 and run NP-HMM-SPEC with the augmented data. As is with\n{{(X (j)\n}N\nconventional EM procedures, this requires the additional assumption that the initial state is the\nstationary distribution of the transition matrix T .\n\nt+2)}`(j)2\n\nt+1, X (j)\n\n, X (j)\n\nt=1\n\nt\n\n5 Analysis\nWe now state our assumptions and main theoretical results. Following [2, 4, 15] we assume i.i.d\nsequences of triples are used for training. With longer sequences, the analysis should only be modi\ufb01ed\nto account for the mixing of the latent state Markov chain, which is inessential for the main intuitions.\nWe begin with the following regularity condition on the HMM.\nAssumption 3. \u21e1> 0 element-wise. T 2 Rm\u21e5m and O 2 R[0,1]\u21e5m are of rank m.\nThe rank condition on O means that emission pdfs are linearly independent. If either T or O are\nrank de\ufb01cient, then the learner may confuse state outputs, which makes learning dif\ufb01cult3. Next,\nwhile we make no parametric assumptions on the emissions, some smoothness conditions are used\nto make density estimation tractable. We use the H\u00f6lder class, H1(, L), which is standard in the\nnonparametrics literature. For = 1, this assumption reduces to L-Lipschitz continuity.\nAssumption 4. All emission densities belong to the H\u00f6lder class, H1(, L). That is, they satisfy,\n\nfor all \u21b5 \uf8ff bc, j 2 [m], s, t 2 [0, 1]\n\nd\u21b5Oj(s)\n\nds\u21b5 \n\nHere bc is the largest integer strictly less than .\n3 Siddiqi et al. [4] show that the discrete spectral algorithm works under a slightly more general setting.\nSimilar results hold for the nonparametric case too but will restrict ourselves to the full rank setting for simplicity.\n\nd\u21b5Oj(t)\n\ndt\u21b5\n\n \uf8ff L|s t||\u21b5|.\n\n\n\n5\n\n\fUnder the above assumptions we bound the total variation distance between the true and the estimated\ndensities of a sequence, x1:t. Let \uf8ff(O) = 1(O)/m(O) denote the condition number of the\nobservation qmatrix. The following theorem states our main result.\nTheorem 5. Pick any suf\ufb01ciently small \u270f> 0 and a failure probability 2 (0, 1). Let t 1. Assume\nthat the HMM satis\ufb01es Assumptions 3 and 4 and the number of samples N satis\ufb01es,\n\nN\nlog(N ) C m1+ 3\n\n2\n\n\uf8ff(O)2+ 3\n\n\n\nm(P21)4+ 4\n\n\n\n \u2713 t\n\u270f\u25c62+ 3\n\nlog\u2713 1\n\n\u25c61+ 3\n\n2\n\n.\n\nThen, with probability at least 1 , the estimated joint density for a t-length sequence satis\ufb01es\n\nR |p(x1:t) bp(x1:t)|dx1:t \uf8ff \u270f. Here, C is a constant depending on and L andbp is from (4).\n\nSynopsis: Observe that the sample complexity depends critically on the conditioning of O and P21.\nThe closer they are to being singular, the more samples is needed to distinguish different states and\nlearn the HMM. It is instructive to compare the results above with the discrete case result of Hsu et al.\n[2], whose sample complexity bound4 is N & m \uf8ff(O)2\nt2\n . Our bound is different in two\n\u270f2 log 1\nm(P21)4\nregards. First, the exponents are worsened by additional \u21e0 1\n terms. This characterizes the dif\ufb01culty\nof the problem in the nonparametric setting. While we do not have any lower bounds, given the\ncurrent understanding of the dif\ufb01culty of various nonparametric tasks [21\u201323], we think our bound\nmight be unimprovable. As the smoothness of the densities increases ! 1, we approach the\nparametric sample complexity. The second difference is the additional log(N ) term on the left hand\nside. This is due to the fact that we want the KDE to concentrate around its expectation in L2 over\n[0, 1], instead of just point-wise. It is not clear to us whether the log can be avoided.\nTo prove Theorem 5, \ufb01rst we will derive some perturbation theory results for c/q-matrices; we will\n\nneed them to bound the deviation of the singular values and vectors when we use bP21 instead of\n\nP21. Some of these perturbation theory results for continuous linear algebra are new and might be of\nindependent interest. Next, we establish a concentration result for the kernel density estimator.\n\n5.1 Some Perturbation Theory Results for C/Q-matrices\nThe \ufb01rst result is an analog of Weyl\u2019s theorem which bounds the difference in the singular values\nin terms of the operator norm of the perturbation. Weyl\u2019s theorem has been studied for general\noperators [24] and cmatrices [6]. We have given one version in Lemma 21 of Appendix B. In addition\nto this, we will also need to bound the difference in the singular vectors and the pseudo-inverses\nof the truth and the estimate. To our knowledge, these results are not yet known. To that end, we\nestablish the following results. Here k(A) denotes the kth singular value of a c/q-matrix A.\nLemma 6 (Simpli\ufb01ed Wedin\u2019s Sine Theorem for Cmatrices). Let A, \u02dcA, E 2 R[0,1]\u21e5[0,1] where\n\u02dcA = A + E and rank(A) = m. Let U, \u02dcU 2 R[a,b]\u21e5m be the \ufb01rst m left singular vectors of A and \u02dcA\nrespectively. Then, for all x 2 Rm, k \u02dcU>U xk2 kxk2q1 2kEk2\nLemma 7 (Pseudo-inverse Theorem for Qmatrices). Let A, \u02dcA, E 2 R[a,b]\u21e5m and \u02dcA = A + E. Then,\n\nL2/m( \u02dcA)2.\n\n1(A\u2020 \u02dcA\u2020) \uf8ff 3 max{1(A\u2020)2, 1(A\u2020)2} 1(E).\n\n5.2 Concentration Bound for the Kernel Density Estimator\nNext, we bound the error for kernel density estimation. To obtain the best rates under H\u00f6lderian\nassumptions on O, the kernels used in KDE need to be of order . A order kernel satis\ufb01es,\n\nZ 1\n\n0\n\nZ 1\n\n0\n\nK(s)ds = 1,\n\ns\u21b5K(s)ds = 0, for all \u21b5 \uf8ff bc,\n\nsK(s)ds \uf8ff 1.\n\n(5)\n\nSuch kernels can be constructed using Legendre polynomials [18]. Given N i.i.d samples from a d\ndimensional density f, where d 2{ 1, 2, 3} and f 2{ P1, P21, P321}, for appropriate choices of the\nbandwidths h1, h21, h321, the KDE \u02c6f 2{ bP1,bP21,bP321} concentrates around f. Informally, we show\n\nP\u21e3k \u02c6f fkL2 >\"\u2318 . exp\u21e3 log(N )\n\n4 Hsu et al. [2] provide a more re\ufb01ned bound but we use this form to simplify the comparison.\n\n2+d \"2\u2318 .\n\n2+d N\n\n(6)\n\n2\n\nd\n\nZ 1\n\n0\n\n6\n\n\fMG-HMM\nNP-HMM-BIN\nNP-HMM-EM\nNP-HMM-SPEC\n\n0.39 \n\n0.388\n\n0.386\n\nr\no\nr\nr\ne\n \ne\nt\nu\nl\no\ns\nb\na\n \n\nn\no\ni\nt\nc\ni\nd\ne\nr\nP\n\nTrue\nMG-HMM\nNP-HMM-BIN\nNP-HMM-EM\nNP-HMM-HSE\nNP-HMM-SPEC\n\n104\n\n103\n\n102\n\n101\n\nc\ne\ns\n \n,\ne\nm\n\ni\nt\n \n\ng\nn\ni\nn\ni\na\nr\nT\n\n0.384\n\n103\n\n105\n\n104\n\nNumber of training sequences\n\nTrue\nMG-HMM\nNP-HMM-BIN\nNP-HMM-EM\nNP-HMM-HSE\nNP-HMM-SPEC\n\n100\n\n103\n\n105\n\n104\n\n103\n\n102\n\nc\ne\ns\n \n,\ne\nm\n\ni\nt\n \ng\nn\ni\nn\ni\na\nr\nT\n\nMG-HMM\nNP-HMM-EM\nNP-HMM-HSE\nNP-HMM-SPEC\n\n104\n\nNumber of training sequences\n\n105\n\nMG-HMM\nNP-HMM-EM\nNP-HMM-HSE\nNP-HMM-SPEC\n\n10 \n\nNumber of training sequences\n\n50 \n\n100\n\n0.5 \n\n0.25 \n\n0.1 \n\n0.05 \n\nr\no\nr\nr\ne\n \n\n1\nL\n \ne\nv\ni\nt\nc\ni\nd\ne\nr\nP\n\n0.025\n\n103\n\n0.5 \n\n0.25 \n\n0.1 \n\nr\no\nr\nr\ne\n \n1\nL\n \ne\nv\ni\nt\nc\ni\nd\ne\nr\nP\n\n0.05 \n\n5 \n\n104\n\nNumber of training sequences\nMG-HMM\nNP-HMM-BIN\nNP-HMM-EM\nNP-HMM-SPEC\n\nr\no\nr\nr\ne\n \ne\nt\nu\nl\no\ns\nb\na\n \nn\no\ni\nt\nc\ni\nd\ne\nr\nP\n\n0.37\n\n0.366\n\n0.362\n\n0.358\n\n10 \n\nNumber of training sequences # 103\n\n50 \n\n100\n\n5 \n\n10 \n\nNumber of training sequences # 103\n\n50 \n\n100\n\n101\n\n5 \n\nFigure 1: The upper and lower panels correspond to m = 4 m = 8 respectively. All \ufb01gures are in log-log\nscale and the x-axis is the number of triples used for training. Left: L1 error between true conditional density\np(x6|x1:5), and the estimate for each method. Middle: The absolute error between the true observation and a\none-step-ahead prediction. The error of the true model is denoted by a black dashed line. Right: Training time.\nfor all suf\ufb01ciently small \" and N/ log N & \"2+ d\n . Here ., & denote inequalities ignoring constants.\nSee Appendix C for a formal statement. Note that when the observations are either discrete or\nparametric, it is possible to estimate the distribution using O(1/\"2) samples to achieve \" error in a\nsuitable metric, say, using the maximum likelihood estimate. However, the nonparametric setting is\ninherently more dif\ufb01cult and therefore the rate of convergence is slower. This slow convergence is\nalso observed in similar concentration bounds for the KDE [25, 26].\nA note on the Proofs: For Lemmas 6, 7 we follow the matrix proof in Stewart and Sun [27] and\nderive several intermediate results for c/q-matrices in the process. The main challenge is that several\nproperties for matrices, e.g. the CS and Schur decompositions, are not known for c/q-matrices. In\naddition, dealing with various notions of convergences with these in\ufb01nite objects can be \ufb01nicky. The\nmain challenge with the KDE concentration result is that we want an L2 bound \u2013 so usual techniques\n(such as McDiarmid\u2019s [13, 18]) do not apply. We use a technical lemma from Gin\u00e9 and Guillou [26]\nwhich allows us to bound the L2 error in terms of the VC characteristics of the class of functions\ninduced by an i.i.d sum of the kernel. The proof of theorem 5 just mimics the discrete case analysis\nof Hsu et al. [2]. While, some care is needed (e.g. kxkL2 \uf8ff kxkL1 does not hold for functional\nnorms) the key ideas carry through once we apply Lemmas 21, 6, 7 and (6). A more re\ufb01ned bound on\nN that is tighter in polylog(N ) terms is possible \u2013 see Corollary 25 and equation 13 in the appendix.\n\n6 Experiments\nWe compare NP-HMM-SPEC to the following. MG-HMM: An HMM trained using EM with the\nemissions modeled as a mixture of Gaussians. We tried 2, 4 and 8 mixtures and report the best result.\nNP-HMM-BIN: A naive baseline where we bin the space into n intervals and use the discrete spectral\nalgorithm [2] with n states. We tried several values for n and report the best. NP-HMM-EM: The\nNonparametric EM heuristic of [12]. NP-HMM-HSE: The Hilbert space embedding method of [15].\nSynthetic Datasets: We \ufb01rst performed a series of experiments on synthetic data where the true\ndistribution is known. The goal is to evaluate the estimated models against the true model. We\ngenerated triples from two HMMs with m = 4 and m = 8 states and nonparametric emissions. The\ndetails of the set up are given in Appendix E. Figure 1 presents the results.\nFirst we compare the methods on estimating the one step ahead conditional density p(x6|x1:5). We\nreport the L1 error between the true and estimated models. In Figure 2 we visualise the estimated one\nstep ahead conditional densities. NP-HMM-SPEC outperforms all methods on this metric. Next, we\ncompare the methods on the prediction performance. That is, we sample sequences of length 6 and test\nhow well a learned model can predict x6 conditioned on x1:5. When comparing on squared error, the\n\nbest predictor is the mean of the distribution. For all methods we use the mean ofbp(x6|x1:5) except\n\n7\n\n\fFigure 2: True and estimated one step ahead densities p(x4|x1:3) for each model. Here m = 4 and N = 104.\n\n0\nX\n\n0\nX\n\nMG-HMM\n0.143 \u00b1 0.001\n0.33 \u00b1 0.018\n0.330 \u00b1 0.002\n\nNP-HMM-BIN\n0.188 \u00b1 0.004\n0.31 \u00b1 0.017\n0.38 \u00b1 0.011\n\nNP-HMM-HSE\n0.0282 \u00b1 0.0003\n0.19 \u00b1 0.012\n0.197 \u00b1 0.001\n\nNP-HMM-SPEC\n0.016 \u00b1 0.0002\n0.15 \u00b1 0.018\n0.225 \u00b1 0.001\n\n0\nX\n\nDataset\nInternet Traf\ufb01c\nLaser Gen\nPatient Sleep\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ny\nt\ni\ns\nn\ne\nd\n \ne\nv\ni\nt\nc\ni\nd\ne\nr\nP\n\n0\n\n-1\n\n-0.5\n\nTruth\nMG-HMM\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ny\nt\ni\ns\nn\ne\nd\n \ne\nv\ni\nt\nc\ni\nd\ne\nr\nP\n\n0.5\n\n1\n\n0\n\n-1\n\n-0.5\n\nTruth\nNP-HMM-HKZ-BIN\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ny\nt\ni\ns\nn\ne\nd\n \ne\nv\ni\nt\nc\ni\nd\ne\nr\nP\n\n0.5\n\n1\n\n0\n\n-1\n\n-0.5\n\n0\nX\n\nTruth\nNP-HMM-EM\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ny\nt\ni\ns\nn\ne\nd\n \ne\nv\ni\nt\nc\ni\nd\ne\nr\nP\n\n0.5\n\n1\n\n0\n\n-1\n\n-0.5\n\nTruth\nNP-HMM-SPEC\n\n0.5\n\n1\n\nTable 1: The mean prediction error and the standard error on the 3 real datasets.\n\nfor NP-HMM-HSE for which we used the mode since the mean cannot be computed. No method can\ndo better than the true model (shown via the dotted line) in expectation. NP-HMM-SPEC achieves\nthe performance of the true model with large datasets. Finally, we compare the training times of all\nmethods. NP-HMM-SPEC is orders of magnitude faster than NP-HMM-HSE and NP-HMM-EM.\nNote that the error of MG-HMM\u2014a parametric model\u2014stops decreasing even with large data. This\nis due to the bias introduced by the parametric assumption. We do not train NP-HMM-EM for longer\nsequences because it is too slow. A limitation of the NP-HMM-HSE method is that it cannot recover\nconditional probabilities, so we exclude it from that experiment. We could not include the method\nof [4] in our comparisons since their code was not available and their method is not straightforward\nto implement. Further, their method cannot compute joint/predictive probabilities.\nReal Datasets: We compare all the above methods (except NP-HMM-EM which was too slow) on\nprediction error on 3 real datasets: internet traf\ufb01c [28], laser generation [29] and sleep data [30]. The\ndetails on these datasets are in Appendix E. For all methods we used the mode of the conditional\ndistribution p(xt+1|x1:t) as the prediction as it performed better. For NP-HMM-SPEC, NP-HMM-\nHSE,NP-HMM-BIN we follow the procedure outlined in Section 4.3 to create triples and train with\nthe triples. In Table 1 we report the mean prediction error and the standard error. NP-HMM-HSE\nand NP-HMM-SPEC perform better than the other two methods. However, NP-HMM-SPEC was\nfaster to train (and has other attractive properties) when compared to NP-HMM-HSE.\n7 Conclusion\nWe proposed and studied a method for estimating the observable representation of a Hidden Markov\nModel whose emission probabilities are smooth nonparametric densities. We derive a bound on the\nsample complexity for our method. While our algorithm is similar to existing methods for discrete\nmodels, many of the ideas that generalise it to the nonparametric setting are new. In comparison\nto other methods, the proposed approach has some desirable characteristics: we can recover the\njoint/conditional densities, our theoretical results are in terms of more interpretable metrics, the\nmethod outperforms baselines and is orders of magnitude faster to train.\nIn this exposition only focused on one dimensional observations. The multidimensional case is\nhandled by extending the above ideas and technology to multivariate functions. Our algorithm and the\nanalysis carry through to the d-dimensional setting, mutatis mutandis. The concern however, is more\npractical. While we have the technology to perform various c/q-matrix operations for d = 1 using\nChebyshev polynomials, this is not yet the case for d > 1. Developing ef\ufb01cient procedures for these\noperations in the high dimensional settings is a challenge for the numerical analysis community and is\nbeyond the scope of this paper. That said, some recent advances in this direction are promising [8, 31].\nWhile our method has focused on HMMs, the ideas in this paper apply for a much broader class\nof problems. Recent advances in spectral methods for estimating parametric predictive state repre-\nsentations [32], mixture models [3] and other latent variable models [33] can be generalised to the\nnonparamatric setting using our ideas. Going forward, we wish to focus on such models.\nAcknowledgements\nThe authors would like to thank Alex Townsend, Arthur Gretton, Ahmed Hefny, Yaoliang Yu, and\nRenato Negrinho for the helpful discussions. This work was supported by NIH R01GM114311,\nAFRL/DARPA FA87501220324.\n\n8\n\n\fReferences\n[1] Lawrence R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech\n\nRecognition. In Proceedings of the IEEE, 1989.\n\n[2] Daniel J. Hsu, Sham M. Kakade, and Tong Zhang. A Spectral Algorithm for Learning Hidden Markov\n\nModels. In COLT, 2009.\n\n[3] Animashree Anandkumar, Daniel Hsu, and Sham M Kakade. A Method of Moments for Mixture Models\n\nand Hidden Markov Models. arXiv preprint arXiv:1203.0683, 2012.\n\n[4] Sajid M. Siddiqi, Byron Boots, and Geoffrey J. Gordon. Reduced-Rank Hidden Markov Models. In\n\nAISTATS, 2010.\n\n2015.\n\n[5] Alex Townsend and Lloyd N Trefethen. Continuous analogues of matrix factorizations. In Proc. R. Soc. A,\n\n[6] Alex Townsend. Computing with Functions in Two Dimensions. PhD thesis, University of Oxford, 2014.\n[7] Tobin A Driscoll, Nicholas Hale, and Lloyd N Trefethen. Chebfun guide. Pafnuty Publ, 2014.\n[8] Townsend, Alex and Trefethen, Lloyd N. An extension of chebfun to two dimensions. SIAM J. Scienti\ufb01c\n\n[9] Michael L Littman, Richard S Sutton, and Satinder P Singh. Predictive representations of state. In NIPS,\n\nComputing, 2013.\n\nvolume 14, pages 1555\u20131561, 2001.\n\n[10] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of the Royal Statistical Society, Series B, 1977.\n\n[11] Lloyd R Welch. Hidden Markov models and the Baum-Welch algorithm. IEEE Information Theory Society\n\nNewsletter, 2003.\n\n[12] Tatiana Benaglia, Didier Chauveau, and David R Hunter. An em-like algorithm for semi-and nonparametric\nestimation in multivariate mixtures. Journal of Computational and Graphical Statistics, 18(2):505\u2013526,\n2009.\n\n[13] Larry Wasserman. All of Nonparametric Statistics. Springer-Verlag NY, 2006.\n[14] Yohann De Castro, \u00c9lisabeth Gassiat, and Claire Lacour. Minimax adaptive estimation of nonparametric\n\nhidden markov models. arXiv preprint arXiv:1501.04787, 2015.\n\n[15] Le Song, Byron Boots, Sajid M Siddiqi, Geoffrey J Gordon, and Alex Smola. Hilbert space embeddings of\n\n[16] Le Song, Animashree Anandkumar, Bo Dai, and Bo Xie. Nonparametric Estimation of Multi-View Latent\n\nhidden markov models. In ICML, 2010.\n\nVariable Models. pages 640\u2013648, 2014.\n\n[17] Herbert Jaeger. Observable operator models for discrete stochastic time series. Neural Computation, 2000.\n[18] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2008.\n[19] L. Fox and I. B. Parker. Chebyshev polynomials in numerical analysis. Oxford U.P. cop., 1968.\n[20] Lloyd N. Trefethen. Approximation Theory and Approximation Practice. Society for Industrial and Applied\n\nMathematics, 2012.\n\n[21] Lucien Birg\u00e9 and Pascal Massart. Estimation of integral functionals of a density. Ann. of Stat., 1995.\n[22] James Robins, Lingling Li, Eric Tchetgen, and Aad W van der Vaart. Quadratic semiparametric Von Mises\n\nCalculus. Metrika, 69(2-3):227\u2013247, 2009.\n\n[23] Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnab\u00e1s P\u00f3czos, Larry Wasserman, and James Robins.\nNonparametric Von Mises Estimators for Entropies, Divergences and Mutual Informations. In NIPS, 2015.\n[24] Woo Young Lee. Weyl\u2019s theorem for operator matrices. Integral Equations and Operator Theory, 1998.\n[25] Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John D. Lafferty, and Larry A. Wasserman. Forest Density\n\nEstimation. Journal of Machine Learning Research, 12:907\u2013951, 2011.\n\n[26] Evarist Gin\u00e9 and Armelle Guillou. Rates of strong uniform consistency for multivariate kernel density\n\nestimators. In Annales de l\u2019IHP Probabilit\u00e9s et statistiques, 2002.\n\n[27] G. W. Stewart and Ji-guang Sun. Matrix Perturbation Theory. Academic Press, 1990.\n[28] Vern Paxson and Sally Floyd. Wide area traf\ufb01c: the failure of Poisson modeling. IEEE/ACM Transactions\n\non Networking, 1995.\n\n[29] U H\u00fcbner, NB Abraham, and CO Weiss. Dimensions and entropies of chaotic intensity pulsations in a\n\nsingle-mode far-infrared NH 3 laser. Physical Review A, 1989.\n\n[30] Santa Fe Time Series Competition. http://www-psych.stanford.edu/ andreas/Time-Series/SantaFe.html.\n[31] Hashemi, B. and Trefethen, L. N. Chebfun to three dimensions. In preparation, 2016.\n[32] Satinder Singh, Michael R. James, and Matthew R. Rudary. Predictive State Representations: A New\n\nTheory for Modeling Dynamical Systems. In UAI, 2004.\n\n[33] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor Decom-\n\npositions for Learning Latent Variable Models. JMLR, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1444, "authors": [{"given_name": "Kirthevasan", "family_name": "Kandasamy", "institution": "CMU"}, {"given_name": "Maruan", "family_name": "Al-Shedivat", "institution": "CMU"}, {"given_name": "Eric", "family_name": "Xing", "institution": "Carnegie Mellon University"}]}