Paper ID: 289
Title: Correlated random features for fast semi-supervised learning
Reviews

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents an idea of using correlated random features for efficient learning in semi-supervised setting. It picks on several areas - Nystrom approximation, CCA and random Fourier features. Overall, the authors have done a nice job of combining the ideas to propose Nystrom based Correlated Views, and doing detailed experiments. The datasets considered are quite comprehensive and the results look good.

It would have been nicer if some synthetic experiment was conducted to illustrate the claims/arguments given in the first paragraph of section 2.3, in particular, the effect weakly correlated features.

The paper is reasonably well written. At some places, I had difficulty in understanding - for example, is the squared error loss for classification problem during training as well? The other comment is statistical comparison multiple algorithms on multiple datasets is well-known (see for example, the paper by Janez Demsar, JMLR 7 (2006, pp:1-30). It would have been nicer if such a comparison was made. This would help in making the performance claims stronger in terms of statistical significance.


There are some typos in the paper (e.g., in conclusion, last but the third sentence).
Q2: Please summarize your review in 1-2 sentences
The paper presents an idea of using correlated random features for efficient learning in semi-supervised setting. Overall, this is a decent paper combining ideas from several areas - namely, Nystrom approximation, canonical correlation analysis and random Fourier features. Nevertheless, novelty is somewhat limited due to the same reason of borrowing ideas and some results from these areas.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
UPDATE: I acknowledge that I have read the author rebuttal.

The authors present a technique for semi-supervised learning based on the following idea: first, they use unlabeled data to learn useful features, using Nystrom featurization together with canonical correlations analysis (the idea being to featurize with respect to multiple subsets of the data, and then use Nystrom to find features that are correlated across the subsets). Once this step is performed, the labeled data is then used to train a model (regularized based on the correlation coefficients found in CCA).

Quality/clarity: the paper is confusing in parts (particularly the description of canonical ridge regression) but well-written overall. The experiments are well-presented and solid.

Originality: most of the ideas are borrowed from elsewhere, but they are combined in a useful way and appear to be well-executed.

Significance: I think the combination of ideas, together with the fact that the experiments are good, makes this paper significant.

Other comments:
The authors may wish to include a reference to recent work on the method of moments, which is another multi-view approach that seems related, at least in this reviewer's naieve intuition. (Feel free to ignore this comment if your intuition disagrees.)
Q2: Please summarize your review in 1-2 sentences
The paper combines together multiple interesting ideas in a technically competent way, and obtains good experimental results. I think this paper is quite strong overall.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary:
This paper brings together two recently popular trends of research, namely random features and multiview regression. It provides a two step algorithm for semi-supervised learning. In the first stage, they generate random features corresponding to two views using the Nystrom method and then in the second stage, they use CCA to bias the optimization towards features correlated across the views via the canonical norm which penalizes less correlated features across the two views more. The experimental results show the superior performance of their approach over a state-of-the-art competitor Laplacian Regularized Least Squares (SSSL)



Comments:
The paper is well motivated and addresses an important problem. The paper operationalizes the ideas proposed in (Kakade and Foster 2007) and presents a Nystrom method to generate two sets of views which follow the multiview assumption (also detailed in (Kakade and Foster 2007)).

In short, the paper harnesses the CCA results from (Kakade and Foster 2007) and Nystrom Method results from (Bach 2013) to come up with an efficient and scalable semi supervised learning algorithm.

That said, I found the paper to have limited mathematical novelty for a venue like NIPS and is mostly an engineering paper.

The authors could have considered joint learning of the random features and CCA bases, which would have been more novel and could possibly lead to better error bounds than the two methods separately.

The experimental results are detailed and the proposed approach (XNV) significantly beats SSSL.

The authors should consider adding an additional tuning parameter (\lambda) for the canonical norm in Algorithm 1 (9), which can hopefully improve accuracy further as I can see the term to be on a different scale than the other two terms in the objective.

Q2: Please summarize your review in 1-2 sentences
Mostly an engineering paper which harnesses the results from (Kakade and Foster 2007) and (Bach 2013) to come up with an efficient and scalable semi supervised learning algorithm.


**My evaluation of the paper remains the same even after reading the author rebuttal. I feel that just the observation of using Random features in CCR paper is not enough.
That said, this is a nice engineering paper better fit for a more applied venue.**
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their comments and suggestions and respond to their main points. We first address the question of novelty raised by reviewers 3 and 6.

Our work is based on the important observation that random features define multiple views that are automatically guaranteed to satisfy the multiview assumption on *any* dataset. We convert Canonical Ridge Regression (CRR, Kakade & Foster, 2007), a theoretically motivated algorithm with impressive performance guarantees, into a general purpose tool that outperforms the current state of the art. The resulting algorithm, XNV is a powerful, computationally efficient algorithm for semi-supervised learning which we demonstrate can be widely applied.

CCR has had little impact to date due to the highly specialized multiview assumption -- which both rarely holds and is difficult to check in practice. We expect our contribution will change this situation.


Reviewer 3:
1. Synthetic experiments with weakly correlated features.
We feel that there is some confusion in terminology regarding the observed variables in the dataset and the features we construct. Since features are constructed randomly, we do not actively manipulate the correlations between them, and hence do not report on the specific effect of weakly correlated features.

An interesting approach, deferred to future work, is sampling from distributions that are designed so that the resulting features are weakly correlated. This should accelerate the decay of correlation coefficients, and may significantly improve performance.

2. Squared loss
We used squared error loss for training. This will be clarified in the final version.

3. Comparing algorithms across data sets.
Following standard practice in the NIPS community, we reported on prediction error with standard deviation measures combined with plots. We appreciate that there are other ways to report comparisons of algorithms and datasets. While we do not report on statistical significance, the average across datasets paints a clear picture of XNV’s improvement over other algorithms for many datasets.

Reviewer 6:
1. Learning representations.
The question of feature learning is an interesting one. Much work has been performed recently using e.g. deep neural networks which have shown to be successful empirically. Typically, however such methods are computationally intense and come with few theoretical guarantees about the properties of the learned representations.

We take a different tack. Random features are cheap to compute, come with strong guarantees, work well in practice, and are easily applied to big datasets!

2. Additional tuning parameter.
Our aim is to introduce a practical, easy-to-use algorithm. We therefore keep tunable parameters to a minimum: the kernel-width and the ridge penalty.

From the theoretical properties of CRR, we expect the performance benefits from adding an additional parameter for the CCA norm would not outweigh the cost in additional tuning-time. However, this cannot be ruled out a priori and will be investigated in the journal version of the paper.