
Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper addresses problem of training twolayer
network for classification with latent variables. The authors propose a
convex SDP relaxation of originally nonconvex training problem. They also
provide an approximate optimization algorithm to solve their SDP
formulation. A proof of concept experiments show promising results,
namely, that the algorithm outperforms both globally optimized
singlelayer models as well as the same twolayer model optimized with
local alternating minimization.
The proposed reformulation which
allows SDP relaxation is interesting and novel. Overall the paper is
sufficiently clear though some parts of the text are dense. The paper
seems to be technically sound.
The main weakness seems to be
complexity of the resulting SDP problem (21). The authors could
mention basic properties of the proposed optimization algorithm for
solving (21), e.g computational time required for the benchmark problems
and whether the algorithm provides a precise solution (i.e. what was
the stopping condition used). This is an important information because
convex problem does not immediately mean easy to solve problem, i.e. a
convex relaxation can be intractable in practice and it should be
clear if it is the case or not. However, the proposed relaxation would be
valuable even in this case.
Minor comments:  equ (17):
variable is missing below the first minimum  equ (18): I think
N=\Phi'*\Phi should appear in the conditions defining the set.  line
331: (d) Synhetic results  line 356: It is unclear why the used
transductive evaluation does not require computing responses of the
network and thus knowledge of W and V.  I could not find the actual
size of instances of the problem (21) solved in the experiments.
Q2: Please summarize your review in 12
sentences
The proposed SDP relaxation is an interesting attempt
to find a better approximation of an important instance of nonconvex
training problems. Though the current algorithm may not be practical
it can inspire development of more efficient
methods. Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a convex approach to train a
twolayer model in supervised learning. This is achieved by incorporating
large margin losses in the training objectives, adopting indmax transfer
for the second layer and multilabel perception model with a step transfer
for the first layer; finally, convex relaxations are applied that make
global training of a twolayer model possible.
The paper is well
written, and the process of linking all the factors together and deriving
a convex objective is interesting and insightful. The authors did a good
job presenting the technical steps taken to the ultimate convex
formulation, with some nice intermediate results. The global training
methodology proposed in this paper is meaningful from a deep learning
perspective, and experimental results are convincing to demonstrate the
significance of global training.
Q2: Please
summarize your review in 12 sentences
The paper presents a convex approach to train a
conditional twolayer model. The technical work in this paper is solid,
while the conclusion is significant and insightful. Submitted
by Assigned_Reviewer_7
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
A convex relaxation of twolayer neural network is
proposed in the paper. This paper is wellwritten. The experiments
show good performance on the "real" datasets. But my main concern is
the scalability of this approach.
The approach of using SDP
for convex relaxation was widely used more than 10 years ago, nothing
new here. Though it has a nice form, the scalability is the major
issue of this type of relaxation. Here, we need to optimize a problem
of t^2, the square of the size of instances, which is probably only
feasible for toy datasets. For the scalability issue, it would be better
to compare the training time among algorithms.
Algorithm 2 takes
advantage of low rank of N. However, the rank of N is not guaranteed
to be small.
For those synthetic experiments, RBF SVM probably can
achieve a quite good. A fair comparison would be Nystrom approximation
to RBF SVM with random selected bases, instead of onelayer linear
SVM.
Q2: Please summarize your review in 12
sentences
Wellwritten. But SDP convex relaxation is not novel.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
Thanks to the reviewers for their comments.
Assigned_Reviewer_5:
As requested, the average
training times (in seconds) of the various twolayer methods are as
follows:
Experiment 1: train/test = 100/100, average training
time: MNIST: CVX2 = 38s (21), LOC2 = 4s, TJB2 = 18s USPS: CVX2 =
43s (23), LOC2 = 14s, TJB2 = 8s Letter: CVX2 = 32s (26), LOC2 = 191s,
TJB2 = 3054s (see Note) CIFAR: CVX2 = 10s (21), LOC2 = 6s, TJB2 = 15s
G241N: CVX2 = 14s (24), LOC2 = 11s, TJB2 = 4s (Average rank of the
CVX2 solutions are shown in brackets.)
Experiment 2: train/test =
200/200, average training time: MNIST: CVX2 = 158s (22), LOC2 = 85s,
TJB2 = 162s USPS: CVX2 = 224s (25), LOC2 = 201s, TJB2 = 3790s (see
Note) Letter: CVX2 = 101s (24), LOC2 = 180s, TJB2 = 66s CIFAR:
CVX2 = 94s (24), LOC2 = 51s, TJB2 = 118s G241N: CVX2 = 260s (34), LOC2
= 21s, TJB2 = 24s (Average rank of the CVX2 solutions are shown in
brackets.)
Recall that in the implementation of the proposed
convex method, CVX2, each boosting step (which adds a single rank to the
solution) is interleaved with local optimization. For the local
optimization we use a standard LBFGS implementation with default
termination conditions. For the outer boosting iterations we terminate
when the relative objective improvement is less than 5e5 or the absolute
improvement is less than 1e3. The average rank results in the above table
corresponds to the number of boosting rounds used by CVX2, which also
determines the rank of its final solutions. From these results, one can
see that the method uses significantly less than the full O(t^2) storage.
Note: We obtained the implementation of TJB2 from [18].
Unfortunately, that implementation experiences convergence issues on
certain problems and settings, and we were not able to rectify the problem
in the obtained code.
Addressing the minor comments:

Thanks  Yes  Thanks  Yes, the transductive evaluation needs
to use V and Phi. More precisely, it uses A and N, since V Phi = A N as
given in Lemma 1.  The experiments involve instances of (21) of two
sizes: t = 200 (in the 100/100 train/test split cases) and t = 400 (in the
200/200 train/test split cases).
Assigned_Reviewer_6:
Thanks
Assigned_Reviewer_7:
As the paper
explains in Section 5, the boosting algorithm developed for CVX2 is
explicitly designed to avoid optimizing over variables of t^2 size. The
experimental results (given above in the response to Assigned_Reviewer_5)
demonstrate that the ranks of the final solutions returned by CVX2 remain
very small compared to the size of the training data. In practice, the
boosting implementation of CVX2 tends to use O(t) rather than Omega(t^2)
space because it terminates with small rank. The experimental results also
show that the training times of the method are not significantly worse
than the other twolayer methods on these problems.
Although
semidefinite relaxation has been applied to other machine learning
problems in the past, no such formulation has previously considered
training a nontrivial latent layer while simultaneously training both the
first and second nonlinear layers in a feedforward architecture. New
results were required to address the key issues overcome in Sections 3 and
5. The closest precursors that use semidefinite relaxations are supervised
latent clustering methods (which were already well reviewed in the paper),
but the formulation presented in the paper goes beyond supervised latent
clustering.
The experiments were explicitly designed to control
the input representations used between the competing learning algorithms.
Obviously feature/kernel engineering is important for any practical
problem, but the key point of this paper is to demonstrate that
nonconvexity is not a necessary aspect of successful deep training,
despite the prevailing folklore to the contrary. Note that it is possible
to use RBF kernels with all the methods presented in this paper, including
CVX2. Therefore, it is justified to compare alternative formulations by
fixing the input representation to a common shared choice. Changing the
input kernel for some methods and not others would only demonstrate the
effects of using a poor kernel, but that is an orthogonal issue to the
main point of this paper.
 