
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper give a rather original approach to characterizing differences between two distributions (e.g. populations) by linear projections. Experiments on synthetic and real data seem to support the conclusions. Looks like a very solid and original paper.
Unfortunately I have low confidence for this review mainly due to my ignorance of relevant literature.
Q2: Please summarize your review in 12 sentences
Interesting and algorithmically sophisticated paper on a method trying to find differences between populations, going beyond mere classification.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors proposed a new dimensionality reduction method that finds the most different direction between input X and Y. The novelty of the proposed method is to use the squared Wasserstein distance as discrepancy measure and can be solved by semidefinite programming. Through experiments, authors showed that the proposed method compares favorably with existing methods.
Quality: The method technically sounds.
Clarity: The paper is well written and easy to follow.
Originality: The approach is new. The problem in this paper is similar to the one in Transfer component analysis, which finds a subspace that have small discrepancy between two datasets.
Significance:
The formulation with Wasserstein distance + SDP is interesting. Thus, it would have some impact in ML community.
Detailed comments: 1. The problem can be easily formulated by using simple Lasso. For example, if we add positive pseudo labels for X and negative pseudo labels for Y and solve the problem Y  Z^t \beta_2^2 + \lamda beta_1, you may be able to obtain similar results. Actually, this approach only useful if X and Y are linearly related, thus it can be a good baseline.
2. Is it possible to extend the algorithm to nonlinear case? 3. For the problem, transfer component analysis can be used to find a most different direction. (Although TCA was originally proposed for finding common subspace, it can be easily applied for your task). http://www.cse.ust.hk/~qyang/Docs/2009/TCA.pdf
Q2: Please summarize your review in 12 sentences
The proposed formulation is interesting. If author add can add simple Lasso based baseline, it would be a plus.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The proposed approach is motivated by the "CramerWold device", which ensures the existence of a linear projection that differentiates two distributions. The authors apply the Wasserstein metric directly on samples from both distributions, and show favorable theoretical properties of such an approach under reasonable assumptions (such as bounded domain variables). The analysis is extended to distributions that differ only in a few dimensions, by considering sparse projections. The resulting computational problem turns out to be nonconvex. The authors propose a semidefinite program relaxation and further preprocessing to estimate a solution. Empirical evaluation is provided on simulated and real datasets, including an application to bioinformatics.
The paper is well written and clearly lays out the main ideas. The paper dresses a problem of interest to the community using a novel technique and sufficient theoretical and empirical validation.
Minor issues:  I would encourage the authors to include more description and legends in each figure so they are more self contained and simplify interpretation.  the term "CramerWold device" seems to be fairly nonstandard outside of the reference [10]. Would the authors prefer the more standard "CramerWold theorem"?
Suggestions for future work: The presented theoretical results are concerned with the sample complexity as compared to the optimal linear projection. It would also be useful to characterize the relationship between the induced divergences and more standard divergences computed directly between the high dimensional distributions, as this would help clarify the convergence in distribution implied by the Cramer Wold Theorem.
Q2: Please summarize your review in 12 sentences
The authors propose a technique to differentiate between distributions, based on the maximal univariate Wasserstein divergence between projections of the underlying random variables. The paper dresses a problem of interest to the community using a novel technique and sufficient theoretical and empirical validation.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposes to find sparse projection bases that maximise the Earth Mover's Distance (EMD) between distributions. To the
best of my knowledge component analysis using EMD has received limited attention (except [A] for NMF). A nonconvex optimisation problem is formulated and solved via a kind of alternating optimization. Some experiments (which importance I cannot fully evaluate) show quite some improvement over the stateoftheart.
[A] Sandler, Roman, and Michael Lindenbaum. "Nonnegative matrix factorization with earth mover's distance metric for image analysis." Pattern Analysis and Machine Intelligence, IEEE Transactions on 33.8 (2011): 15901602.
After reading the rebuttal I believe the paper is of NIPS quality and recommend acceptance.
Q2: Please summarize your review in 12 sentences
The paper proposes to find sparse projection bases that maximise the Earth Mover's Distance (EMD) between distributions. To the
best of my knowledge component analysis using EMD has received limited attention. An alternating optimisation procedure is proposed to find the bases. Some experiments (which importance I cannot fully evaluate) show quite some improvement over the stateoftheart. I recommend acceptance.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their comments and
interest. In our response, we abbreviate reviewer names (eg. R1 =
Assigned_Reviewer_1).
R2 proposes a baseline method to compare
with. Our interpretation of the comment is that in the expression Y 
Z^t beta_2, R2 uses Z to denote the featurevector and Y a 01 label, so
this proposal corresponds to standard leastsquares regression (with
lasso). Generally, logistic (lasso) regression is preferable for binary
responses [1]. As we already evaluated our approach against the latter
method (Figure 1b), the proposed comparison seems unnecessary given the
space constraints.
R2 suggests extending our method to nonlinear
projections. Due to the CramerWold device, our linear PDA can already
capture general effects as the marginal distribution after projection
remains unconstrained. While nonlinear extensions are certainly
promising, we believe this would entail a separate method (better
introduced in another paper) that leverages ideas from RKHS/kernel theory,
principal curves, or manifold learning. Here, the primary challenge would
be retaining interpretability.
R2 suggests comparing with a
maxratherthanmin variant of Transfer Component Analysis [2]. This is
certainly a related method, but the nonlinearity of the feature mapping
hinders interpretability (and if feature maps are not sufficiently
complex/nonlinear, i.e. corresponding to a "characteristic" kernel, then
MMD  the divergence optimized in TCA  is not a proper distance between
distributions because the mean map is no longer injective [3]). Without
additional modification to enhance interpretability, this TCAvariant is
better suited for twosample testing. We have already compared our
approach against the closely related MMD statistic (intended for
highdimensional problems) [3] and many other popular twosample tests.
Furthermore as R2 states, the actual TCA method is not relevant and would
require modification.
R6 suggests comparing with the MMD twosample
test from A. Gretton [3]. We did this in Figure 1c (green curve), but
there may be confusion because we poorly refer to this method as "Mean Map
Discrepancy" (referencing an earlier name) rather than "Maximum Mean
Discrepancy". We will change to the proper latter name. Furthermore, our
methods have additional uses beyond twosample testing / measuring
distance between distributions, unlike MMD (the weights in our projection
have interpretations which provide additional characterization of the
interpopulation differences beyond a single pvalue or computed
distance).
R3 brings up a good point that Earth Mover distances
have received limited attention in component analysis. We shall mention
this as background and include the reference provided by the reviewer,
along with highlighting other works that successfully employ Wasserstein
metrics surveyed in [4].
This should also address R6's concern
why Wasserstein divergence is of particular interest. We can also further
motivate this metric intuitively. For example, the Wasserstein distance
provides a dissimilarity measure between populations that integrates both
the fraction of individuals which are different together with the
magnitude of the differences.
There is an interesting
interpretation of Wasserstein PDA (which we omitted due to space, but
could include, e.g. in supplement): Consider a twoclass problem with
paired samples (x_i, y_i) where X, Y follow two multivariate Gaussian
distributions. The first principal component of the paireddifferences X 
Y (in PCA without centering) equals the direction which maximizes the
projected Wasserstein distance between the distribution of X  Y and a
delta distribution at the origin. Thus (uncentered) PCA can be viewed as a
special case of PDA in the Gaussian, pairedsampledifferences case.
R4 provides a nice CramerWold reference that we will include in
our final draft.
R5 suggests more detailed Figure 1 legend/captions
which we will address.
R5 provides ideas for future theoretical
investigation, particularly the desire for connections to morestandard
divergences computed between the original highdimensional distributions.
We agree this is a good idea.
R5 views term "Cramer Wold Theorem"
as more standard than "Cramer Wold device". We believe the latter is
actually more common (cf. [5], a standard text) but are happy to use
whichever is more understandable.
[1] Brannick MT. Logistic
Regression. http://faculty.cas.usf.edu/mbrannick/regression/Logistic.html
[2]
Pan SJ, Tsang IW, Kwok JT, Yang Q (2009) Domain Adaptation via Transfer
Component Analysis. IJCAI.
[3] Gretton A, Borgwardt KM, Rasch MJ,
Scholkopf B, Smola A (2012) A Kernel TwoSample Test. JMLR 13:
723773.
[4] Levina E, Bickel P (2001) The Earth Mover's distance
is the Mallows distance: some insights from statistics. ICCV 2:
251256.
[5] Durrett, R. (2010) Probability Theory and Examples.
4th edition: pp 176. 
