Paper ID: 14
Title: The Randomized Dependence Coefficient
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper introduces a new method called RDC to measure the statistical dependence between random variables. It combines a copula transform to a variant of kernel CCA using random projections, resulting in a O(n log n) complexity. Results on synthetic and real benchmark data show promising results for feature selection.

The paper is overall clear and pleasant to read. The good experimental results and simplicity of implementation suggest that the proposed method may be useful in complement to other existing methods.

The originality is limited, since it mostly combines several "tricks" that have been used in the past, namely the copula transformation to make the measure invariant to monotonic transformations (see eg Conover and Iman, The American Statistician, 35(3):124-129, 1981 or more recently the reference [15] cited by the authors), and the random projection trick to define a low-rank approximation of a kernel matrix.

Although the work is technically correct, the following points would require clarification:

- the authors insist that RDC is much faster to implement than kCCA, claimed to be in O(n^3) and taking 166s for 1000 points in the experiments. I am surprised by this, since in the original kCCA paper of Bach and Jordan an implementation in O(n) using incomplete Cholevski decomposition is proposed, and the authors claim there that it takes 0.5 seconds on 1000 points (more than 10 years ago). In fact, the incomplete Cholevski decomposition of Bach is a very popular approach to run kernel methods on large numbers of samples, similar to the random projection trick used by the authors. A natural question is then: to perform kCCA with large n, is it really better to use the random projection trick compared to the incomplete Cholevski decomposition?

- as pointed out by the authors, when the dimension k of random projections gets large the method converges to unregularized kCCA, so it is important that k is not too large because "some" regularization of kCCA is needed. Hence k seems to play a crucial regularization role, akin to the regularization parameter in kCCA. Is there any theoretical or empirical argument in favor of a regularization by k, compared to the classical kCCA regularization? An argument against using k is that, when k is not too large, the random fluctuations may lead to significant fluctuations in the final score, which is not a good property. In fact, although RDC fulfills all conditions in Table 1, one that is not fulfilled is that it is a stochastic number (ie, compute it twice and you get different values).
Q2: Please summarize your review in 1-2 sentences
An interesting work combining several known ideas, but the comparison with the nearest cousin kCCA is not really fair and well-studied.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The RDC is a non-linear dependency estimator that satisfies Renyi's criteria and exploits the very recent FastFood speedup trick (ICML13). This is a straightforward recipe: 1) copularize the data, effectively preserving the dependency structure while ignoring the marginals, 2) sample k non-linear features of each datum (inspired from Bochner's theorem) and 3) solve the regular CCA eigenvalue problem on the resulting paired datasets. Ultimately, RDC feels like a copularised variation of kCCA (misleading as this may sound). Its efficiency is illustrated successfully on a set of classical non-linear bivariate dependency scenarios and 12 real datasets via a forward feature selection procedure.

I found the paper very clear and easy to understand. Even though the idea simply puts together the right pieces, it remains a significant contribution to the literature.

Some notes:
- I do not like that k was simply 'set to 10'. It seems to play a more important role than the paper implies.
- line 74: what do you mean by "same structure as HGR"?
- eq.(5): m -> n
- line 217: which is independent -> by independence
Q2: Please summarize your review in 1-2 sentences
RDC is a straightforward and computationally efficient estimator of the HGR coefficient but the choice of k deserves more discussion.

I've read the author's rebuttal.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors propose a non-linear measure of dependence between two random variables. This turns out to be the canonical correlation between random, nonlinear projections of the variables after a copula transformation which renders the marginals of the r.vs invariant to linear transformations.

This work details a simple but seemingly powerful procedure for quantifying non-linear dependancies between random variables. Part of the key idea of the work is the use of non-linear random projections via the "random fourier features" of Rahimi and Recht (2007). The authors compare their new randomised dependence coefficient (RDC) against a whole host of other coefficients for measuring both linear and non-linear dependence. They also consider the suitability of the RDC for screening for variable selection in linear prediction problems. I think these are useful comparisons and experiments and they allow the reader to get a decent feel for the behaviour of RDC.

The overall exposition of the paper is clear and each component part of the procedure is clearly described. In this respect the schematic diagram in figure 1 is particularly useful. Similarly, the comparison between other dependence coefficients combined with the empirical results is very illuminating and suggests that the performance of RDC is very promising.

Although the authors choose to use RFFs, recently comparisons have been made with other schemes for generating random features have been made. Yang et al (2012) "Nystrom Method vs Random Fourier Features: A Theoretical and Empirical Comparison" flesh out some more of the theoretical properties of the difference between RFF and Nystrom features. One difference in particular is the difference in sampling scheme: Nystrom samples randomly in such a way that it takes the probability distribution of the data into account. In this sense it achieves a better estimate of the underlying kernel. If this is an important property of the proposed RDC, perhaps looking at Nystrom features would be interesting. It also would be interesting to see the effect of different types of non-linear projections where RFFs are limited to shift invariant kernels (with some other extensions), Nystrom features can be computed for any type of kernel function.

Also, Yang et al show empirically that there is a large improvement in predictive performance when more random features are used - it would be interesting to see what happens when k is increased.

I have some concerns about the theoretical treatment presented.
The discussion of the bound in (10) seems to be skipping over the tradeoff involving k. It seems that in order to drive the LC/sqrt(k) term to be small (which could be quite slow?), ||m||_F could grow very large compared with n but this is hidden by referring to the dependancy as O(1/sqrt(n)). As it stands I'm not sure the analysis is sufficient to complement the good empirical performance.

With regards to the properties of the RFF approximation for linear prediction, one possibility is that the approximation acts as a regulariser. For this reason, I am not sure that the type of generalisation bound used for prediction algorithms are completely appropriate to quantify the performance and analyse the behaviour of a dependence measure.

One small point: the reference to Recht is incorrectly referred to as "Brecht" in the text.
===============

I have read the author rebuttal and I am satisfied with the response. The authors should be sure to clarify the constraints in the CCA optimisation problem.
Q2: Please summarize your review in 1-2 sentences
I think overall the work is extremely interesting and appears to work well empirically. As it stands I think the theoretical analysis is incomplete although that does detract too much from the impact and importance of the work.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper gives a new approach to nonlinear correlation. The approach consists of three steps: (1) copula transform, (2) random nonlinear projection, and (3) CCA. In each step, the authors give a theoretical guarantee to make the approach convincing. The authors demonstrate the usefulness of the approach using synthetic and real data set.

This is a well-written paper having high novelty. As the authors describe, the RDC score is easy to compute. Their experiments, Fig.4 in particular, clearly demonstrate the utility of the approach. I enjoyed reading the paper very much. Although calculating nonlinear correlation itself has not been the central topic in the machine learning community, I found it an interesting building block for more realistic tasks such as regression as one of the experiments already suggests. I recommend accepting the paper.
Q2: Please summarize your review in 1-2 sentences
New and interesting approach to the metric of nonlinear correlation. The novelty is high.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Dear reviewers,

Thank you for the supportive feedback and useful comments; we are delighted to see this positive response to our submission.

"comparison to kCCA with Cholesky decomposition"

This is a fair point. We downloaded the implementation of kCCA from Bach and Jordan. We used the default parameters (a moderate amount of them exist to fine tune the procedure) and the provided MEX/C compiled routines. The running times were much slower than RDC: 0.40, 3.24 and 43.80 seconds for respective sample sizes of 1000, 10000 and 100000. Preliminary power measurements indicate worse accuracy and higher sensibility to regularization than RDC. The implementation is also notably more involved (around 1000 lines of MATLAB). A detailed comparison will be added to the paper.

Conceptually, we see several advantages for random features:
1) For n samples of d dimensions and approximations of rank k, Fastfood expansions require less RAM (O(k) << O(kd)) and less CPU time (O(nk log d) << O(nk^2d)) than low-rank kernel matrix approximations such as Nystrom or Incomplete Cholesky Decomposition. This difference is crucial, for instance, when dealing with collections of images (large d) or big k.
2) Random-Feature expansions can outperform low-rank kernel matrix approximations (Le et al., 2013). This is because one might not want to focus on approximating the kernel in the first place (as in Nystrom), but to construct a good span of random basis functions for regression (as in random features).
This discussion will also be included in the paper.

"selection of k, number of random features"

This is the most interesting and challenging problem, which translates to the classical dilemma of (unsupervisedly!) choosing a regularizer or approximation rank: the more random features we allow, the more flexible the function class becomes. However, we experienced a rather satisfactory robustness against the number of random features: when iterating k from 5 to 500, we usually observed variations in the RDC score of less than one percent. However, this is not the case for the choice of the random basis function family (i.e., the kernel function in the kernel machine analogy), which seems of much greater importance. Thanks for pointing this out, a better illustration of this matter will be included in the paper.

"comparison to Nystrom"

Many thanks to reviewer for pointing out (Yang et al., 2012). We will try out the Nystrom idea.

"the norm of m in the theoretical analysis"

Given the constraints of CCA optimization, the L2-norms of alpha and beta are both constrained to 1, which effectively controls the Frobenius norm of m. The influence of k is incorporated to the bound analogously as in Rahimi and Recht (2008). Please note that the bound measures the distance between RDC and HGR when the latter is constrained to the function class F. This will be clarified in the paper.

"the proposed method is a concatenation of previously known techniques"

It is true that each of the steps that form RDC is previously known. But we don't think that, had someone used them in this way for dependency analysis in a paper on another matter, it would have been considered an "obvious" thing to do with no need for further analysis. Our paper motivates each individual step, analyses it theoretically, offers empirical evaluation and demonstrates the ease of implementation.

"random fluctuations"

As shown in Section 5, random fluctuations had little impact on RDC performance. Random variation is a valid theoretical concern, but also applies to other approximations needed in the large-scale setting, such as low-rank approximations to kernel matrices. This motivates the interesting idea of building distributions over RDC scores on a given dataset to yield a more powerful analysis of its value and confidence (as in bootstrap).