Paper ID: | 1444 |
---|---|

Title: | Learning HMMs with Nonparametric Emissions via Spectral Decompositions of Continuous Matrices |

This paper presents an extension of the spectral algorithm for learning HMM by Hsu et al. for the case with continuous non-parametric emission distributions. The derivation and analysis of the algorithm follow closely the arguments for the case with discrete observations, but under the surface novel perturbation bounds for spectral properties of so-called continuous matrices are derived.

Combining spectral techniques with kernel density estimator to deal with non-parametric distributions is a very important endeavor. This paper presents an interesting approach to this problem that leverages recent developments in applied mathematics; in particular, the idea of approximating a KDE with Chebyshev polynomials to make computations faster seems particularly promising. The main weakness of the paper is the lack of comparison with previous approaches to this problem, eg. [1]. [1] L. Song, A. Anandkumar, B. Dai and B. Xie. Nonparametric estimation of multi-view latent variable models. International Conference on Machine Learning (ICML 2014)

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

This paper extends the spectral learning approach of [2] from discrete random observations to continuous random observations. The main challenges are twofold: algorithmic and analytical. The proposed spectral method for infinite dimensional space uses recent advances in [5] together with Chebyshev polynomial expansions. Analytically, the main challenge is in extending matrix perturbation bounds from linear algebra and random matrix theory to the functional space. The main technical contribution is in the generalization of matrix perturbation bounds for c/q-matrices, and putting them together to give a sample complexity for the HMM learning problem.

The definition of Holder class is standard, but the definition presented in this paper is different from a similar one in, e.g., [25] definition 3 or Definition 1 in "Nonparametric Estimation of Renyi Divergence and Friends" by Krishnamurthy et al. Is this difference a mistake/typo in either the above referenced papers or this manuscript? Or is there a fundamental reason for using a slightly different definitions in these line of research? This manuscript builds upon a several recent advances. Mainly, using Chebyshev polynomials for continuous linear algebra, and c/q-matrix perturbation bounds. However, an important step is missing, and this paper falls short of bridging the gap. The main reason one wishes to generalize methods from [2] on discrete random observations to continuous ones is to apply to high dimensional data. The difference in both the analysis and also the algorithm becomes prominent in higher dimensions, where interesting innovations are required. The main reason spectral methods are popular is also due to its algorithmic cleanness (which stems from the advances in numerical analyses) and simple analysis (which stems from matrix perturbation bounds). Although this paper tries to address these two challenges in continuous matrices, it falls short of giving a numerical method to solve the problem in dimensions larger than one. However, the solutions provided in this paper is an interesting first step in this direction, and it would benefit the readers significantly, if the authors can provide exactly what the technical difficulties are in high dimensional continuous matrix analyses. For example, can we not use kernels to approximate the top singular function as well?

2-Confident (read it all; understood it all reasonably well)

The paper discuss of a novel method of estimating emission pdfs for HMMs. Differently than other methods, they estimate the emissions using nonparametric pdf. The nonparametric pdfs are estimated by means of spectral decomposition. The main novelties of this paper are: estimation of nonparametric probability density functions for emissions, and the fact they used well-established linear algebra algorithm (eg SVD), using continuous linear algebra (quas- and continuos matrices).

To the best of my knowledge, what I could see is that the paper seems to be the outcome of a very good job from the author(s). Specifically, it is well written, mathematically sounds and experiments are convincing. I cannot say anything about this work. In order to improve it, I would suggest the following minor changes: - Sec 3, line 131, they define P1, P21 and P321. Moreover, they have x1, x2, and x3 as "observation at time t". Why do they have 3 observations? It's an odd enumeration that was not clear. - The name of the proposed method, HM-SPEC-HMM, should be reported earlier as well (eg abstract and introduction).

1-Less confident (might not have understood significant parts)

Please ignore this review.

Please ignore this review.

1-Less confident (might not have understood significant parts)

The paper proposes a learning approach for HMMs with non-parametric emission distributions that computes the observable representation with the help of kernel density estimates. An approximation of the kernel density estimates with Chebyshev polynomials improves the runtime of the approach and ensures its practical applicability.

Strengths and Weaknesses ======================== Positive aspects: + The paper is well written and has a clear and coherent structure. The discussion of related work is comprehensive. + Non-parametric emission distributions add flexibility to the general HMM framework and reduce bias due to wrong modeling assumptions. Progress in this area should have theoretical and practical impact. + The paper builds upon existing spectral methods for parametric HMMs but introduces novel techniques to extend those approaches to the non-parametric case. + The theoretical bounds (section 5) are interesting, even though most of the results are special cases or straightforward extensions of known results. Negative aspects: - The restriction to triplets (or a sliding window of length 3) is quite limiting. Is this a fundamental limitation of the approach or is an extension to longer subsequences (without a sliding window) straightforward? - Since the paper mentions the possibility to use Chebyshev polynomials to achieve a speed-up, it would have been interesting to see a runtime comparison at test time. - The presentation is at times too equation-driven and the notation, especially in chapter 3, quite convoluted and hard to follow. An illustrative figure of the key concepts in section 3 would have been helpful. - The experimental evaluation compares the proposed approach to 4 other HMM baselines. Even though NP-SPEC-HMM outperforms those baselines, the experimental evaluation has only toy character (simple length 6 conditional distributions, only one training/test sequence in case of the real datasets). Minor points ============ * l.22: Only few parametric distributions allow for tractable exact inference in an HMM. * l.183: Much faster approximations than Chebyshev polynomials exist for the evaluation of kernel density estimates, especially in low-dimensional spaces (e.g., based on fast multipole methods). * Figure 1: There is probably a ‘x 10^3’ missing in the plot on the bottom right. Questions ========= * Eq. (3): Why the restriction to an isotropic bandwidth and a product kernel? Especially a diagonal bandwidth matrix could have been helpful. Would the approximation with Chebyshev polynomials still work? * The paper focuses on learning HMMs with non-parametric emission distributions, but it does not become clear how those emission distributions affect inference. Which of the common inference tasks in a discrete HMM (filtering, smoothing, marginal observation likelihood) can be computed exactly/approximately with an NP-SPEC-HMM? * Is it computationally feasible to use the proposed model in a more realistic application, e.g., action recognition from motion capture sequences? Conclusion ========== The paper is well written and, from a theoretical perspective, interesting to read. However, the experiments are weak and it remains unclear how practical the proposed model would be in real applications. I’m tending towards accept, but the authors should comment in their rebuttal on the above points.

2-Confident (read it all; understood it all reasonably well)