
Submitted by Assigned_Reviewer_12
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a novel distance and inner product on the space of positive, regularized selfadjoint operators. By considering the correspondence between a Hilbert space and the above space through log/exp map, the authors successfully introduce a geometrically natural inner product on the space. The distance can be regarded as an infinite dimensional generalization of the logEuclidean distance defined for the finite dimensional positive definite matrices. The proposed distance has been applied to define positive definite kernels on the kernel covariance expressions of images, and has shown better performance for some image classification examples.
The paper is very clearly written, and easy to read at least to those who have solid math back ground. The ideas and meaning of the notions are clearly explained. The mathematics seems also clean.
While the experimental results are favorable for the proposed logHS method, I think the experimental comparison is a bit weak: the used data sets are only three image data sets. Experiments with various type and size of data sets will make the paper stronger. Also, the comparison is limited to covariancebased methods. Including other approaches will be preferable. For example, within kernel methods, SVM for kernel mean expressions of data distributions (Muandet et al 2013) has been used for images. To express the higherorder statistics in data, the mean of the feature vector may have the same amount of information as the covariance on the feature space.
It will be better to include some discussion on computational cost to compare the cost of logHS with that of affineinvariant metric and divergence approaches.
References: K. Muandet, K. Fukumizu, F. Dinuzzo, B. Schoelkopf. (2012) Learning from Distributions via Support Measure Machines. Advances in NIPS 25 pp.1018.
Q2: Please summarize your review in 12 sentences
This paper proposes a novel inner product and kernels based on the geometry of the space of positive, regularized selfadjoint operators. The paper is theoretically sound, and the proposed method is practically potential also. Submitted by Assigned_Reviewer_17
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary:
This paper extends the idea of LogEuclidean metric (LEM) on the Riemannian manifold of positive definite matrices to an infinitedimensional setting. In this work, the proposed LogHilbertSchmidt metric (LHSM) provides ways of computing distance between positive definite operators on a Hilbert space. From a theoretical perspective, extending the LEM to an infinite dimensional space is not a trivial task because one cannot obtain the infinitedimensional counterpart of LEM by simply letting the dimension approach infinity.
There are two fundamental problems which have been addressed in this paper. First, if A is a selfadjoint operator in an infinitedimensional Hilbert space, log(A), which is an important ingredience in LogEuclidean metric, is not welldefined and may not be bounded. Second, the identity operator in this space is not HilbertSchmidt and would have infinite distance from any HilbertSchmidt operator. This paper gives a rigorous solutions to the aforementioned problems. Moreover, when the covariance operator is defined on an RKHS induced by a positive definite kernel, the paper gives an explicit formulas for the proposed metric that can be evaluated entirely via the corresponding Gram matrix.
From application point of view, the proposed metric is applied in the task of multicategory image classification in which each image is represented by an infinitedimensional RKHS covariance operator constructed from "original features" in the input space. The proposed metric is found to outperform the stateoftheart algorithm which only considers the finite covariance matrix in the input space. However, it will be more interesting to compare this method to other image classification approaches, e.g., those based on deep learning.
The idea presented in this paper makes sense intuitively, and will be very useful in practice because one can view this framework as a nonlinear extension of learning problems when the samples are represented by the covariance matrices. In addition to images, I believe one can broadly apply this metric to other kinds of data which have highorder correlation. Moreover, defining the metric in the RKHS allows even more flexibility as one can freely choose different kernel functions.
Quality:
The paper is of high quality. The detail of the claims and proofs are given in great detail.
Clarity:
The paper is clearly written and I am able to follow the logic.
Originality:
The novelty of this work is the rigorous formulation of infiniteversion of LogEuclidean metric. The problem is not trivial and I think the contribution of this work in term of technical results is quite substantial.
Significance:
The proposed metric is shown to outperform stateoftheart algorithms, so the contribution of this work should be significant in practice. However, the authors didn't compare different approaches in term of computational complexity, so it is difficult to conclude from this aspect.
Some minor comments:
 If I understand correctly, the covariance operator is defined for each sample, and therefore is fundamentally different from the standard way that the covariance operator has been defined such as in kernel PCA. That is, the covariance operator is constructed from a sample and as a sample size goes to infinity, we obtain the population counterpart of the underlying distribution. On the other hand, in this work each image is represented by a covariance operator. Do you have an intuitive explanation of what the population counterpart of this covariance operator is? Are they fundamentally the same or different?
 In section 6, what are the "original features" extracted from the image and how exactly these features are generated? Q2: Please summarize your review in 12 sentences
This work proposes a nontrivial extension of LogEuclidean metric to infinitedimensional space, e.g., RKHS with Gaussian RBF kernel. The contribution of this paper is substantial both in term of theoretical results and empirical results. Submitted by Assigned_Reviewer_42
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper introduces LogHilbertSchmidt metric for positive definite unitized operators. This allows measuring distances of infinitedimensional operators (e.g. covariance). The paper is superbly written, presents the complex content in clear manner and demonstrates strong results on image dataset classifications. The loghilbertschmidt metric is a strong contribution to kernel methods.
The paper is based on two contributions. First, a loghilbertschmidt metric generalizing the logeuclidian metric into rkhs. The metric is elaborated in the case of regularized covariance operators. The operators themselves are defined to be unitized HSoperators.
The paper is of high quality, very clear, original and significant.
specific comments:
 isnt' the \otimes on row 175 a function of w  should the \gamma on eq 9 be nonnegative  given that eq 8 is infinite when A=I, wouldn't eq. 11 also be infinite in the same case  how is eHSnorm defined when over a matrixlog (e.g. eq 13) Q2: Please summarize your review in 12 sentences
A superb paper introducing loghilbertschmidt metric between operators, eg. covariance operators on hilbert spaces, with excellent results. Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point. We would like to thank all three reviewers for the constructive and generally positive comments on our paper.
We will address all of your comments either in the final version of the current paper or in a longer version, where we will report more experimental results and further properties of the logHS metric.
Regarding the experiments:
With the current experimental results in the paper, we aim to show the substantial performance gain of the logHS metric over the finitedimensional methods. We agree with both reviewers AR12 and AR17 that it would be very interesting to try our framework on other datatsets and to compare our results with other approaches not based on covariance matrices. We are currently working on this and will report further results either in the Supplementary Material or in the longer version of the paper.
Regarding the computational complexity:
We will discuss this in more detail in the final version of the paper. Briefly speaking, let n be the number of features and m be the number of observations for each sample. Computing the logEuclidean metric between two n x n covariance matrices takes time O(n^2m + n^3), since it takes O(n^2m) time to compute the matrices themselves and the SVD of an n x n matrix takes time O(n^3). On the other hand, computing the logHS metric between two infinitedimensional covariance operators in RKHS, where we deal with Gram matrices of size m x m, takes time O(m^3). This is of the same order as the computational complexity for computing the Stein and Jeffreys divergences in RKHS in [10] (Harandi et al, CVPR 2014). We are presently working on numerical methods to make this amenable to large scale datatses.
To Reviewer AR12:
We agree with the reviewer that it would be interesting to perform experiments with more data types and also with approaches such as the Support Measure Machine (SMM) of Muandet et al. (NIPS 2012). We note that while the mean vector in the feature space may theoretically characterize a distribution, in practice, when we only have finite samples, it can be difficult to separate two distributions based on their mean vectors. This has been studied for example in Sriperumbudur et al, Hilbert Space Embeddings and Metrics on Probability Measures (JMLR 2010). We will add a discussion of this point in the final version of the paper. Thus the performance of the SMM will also depend on the particular application at hand. Experimental results comparing the approach of Muandet et al. (NIPS 2012) and our framework will be reported either in the Supplementary Material or in the longer version of the paper.
To Reviewer AR17:
 We agree with the reviewer that it would be interesting to compare our approach with other image classification approaches, e.g. those based on deep learning. We note that our framework can be viewed as a kernel machine with 2 nonlinear layers and we will add a short discussion of this aspect in the final version of the paper.
 The covariance operator: As in standard covariancebased learning, a covariance operator is defined for each sample, which is assumed to be generated by its own probability distribution. As such, it is fundamentally different from that of Kernel PCA. In our setting, covariance operators for samples belonging to the same class should be close to each other under the logHS metric, and those corresponding to samples belonging to different classes should be far apart. Thus, we currently do not consider a covariance operator for all samples. However, after computing the logHS distances between the samples, we can define a new kernel using these distances, with a corresponding covariance operator for all samples. This way, we can perform Kernel PCA (or any kernel method for that matter) for all the samples, just like we currently perform SVM on top of the logHS metric computation, giving a 2layer kernel machine. We will include a discussion of this aspect in the final version of the paper and perform further numerical experiments, which will be reported in the longer version of the paper.
 Section 6: the “original features” for an image are, for example, the intensity and gradients at each pixel in the image. In general, they are lowlevel features extracted from the pixel values in the image. We describe how they are extracted for the Texture dataset in lines 393397.
To Reviewer AR42:
 Line 175: The \otimes here denotes the outer product notation, which is defined by the second expression on this line.
 Equation (9): gamma is not required to be nonnegative in the definition of the extended HilbertSchmidt algebra. Later, with the additional hypothesis that (A+gamma I) is positive definite, gamma is automatically positive if dim(H) = infinity. However, if dim(H) < infinity, gamma can be negative, as long as (A+gamma I) has a positive spectrum. We will add a discussion of this aspect in the final version of the paper. Note that we explicitly require gamma > 0 from Section 4.2 onwards, when we focus on regularized positive operators, which form a subset of the manifold of positive definite operators.
 It is true that when A = I, Eq (8) is infinite, since it refers to the HS norm. However, Eq (11) is finite when A is an HS operator, since it refers to the extendedHS norm. We will make it clearer in the final version that the operator A in Eq (11) is HS, as defined in Eq. (9).
 Eq. (13): it can be shown that the log expression on the right hand side has the form (C + z I) for an HSoperator C and a real number z, so that Eq. (11) can be applied to give the expression for the eHS norm. We will make this clearer in the final version of the paper.
 