|
Submitted by
Assigned_Reviewer_4
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This is a well-written paper that extends current work
on parameter estimation in "superposition-structured" problems. This
includes sparse + group sparse structure, sparse + low rank, etc. The
paper establishes error bounds for regularized M-estimators, where the
loss function is convex and the penalty is a convex combination of
regularizers corresponding to each of the regularizers promoting a
specific type of structure on the model parameter.
The most
relevant past work to the results of this paper seems to be Agarwal et al.
'12; the paper compares the statements of Theorem 1 and Corollary 4 with
corresponding results in that paper, and claims that the bounds are
tighter (converging to 0 as the number of samples n increases) and the
imposed conditions are weaker. From the point of view of a non-expert, it
would be nice to explain these differences more carefully. For instance,
Agarwal et al. proves matching lower bounds, so the improvements must
arise from the difference in the class of models considered.
Otherwise, the theoretical conclusions of the paper seem sound and
proofs are clearly written (I only read through a subset of them). Other
comments:
(1) You might want to mention that condition (C3) is
only required to hold locally around \theta^*, and not globally over all
pairs (\theta_1, \theta_2).
(2) It's a bit confusing to figure out
what exactly \Omega_\alpha refers to. In (C3) and (C4), this is called a
"restricted set." Shouldn't the statement of Theorem 1 have some
dependence on what restricted sets \Omega_\alpha are involved in (C3) and
(C4)? For the sake of the corollaries in the paper, it seems \Omega_\alpha
= \Theta, since all models considered are actually linear with an \ell_2
loss.
(3) Following the previous point, it would be interesting to
say something about cases when (C3) and (C4) might not hold over the
entire space \Theta, and one would need to work a bit harder to construct
the appropriate analog of Theorem 1 over restricted sets. It would also be
interesting to have some concrete examples with more general loss
functions.
(4) The use of "local"/"global" terminology to refer to
RSC conditions (lines 421 and 555) is confusing. Here, the authors seem to
use "local" RSC to refer to deviations in one component only, whereas
"global" refers to arbitrary deviations from \theta^*. However,
local/global RSC would more naturally be used to describe RSC conditions
which hold over a restricted radius vs. the entire space.
(5)
Although the contribution of this paper is entirely theoretical, it would
be nice to have some more motivating examples for why such a general,
unified framework might be useful in practice (and what the impact of this
paper is beyond the special cases already considered by previous
authors). Q2: Please summarize your review in 1-2
sentences
This is a solid theoretical paper that is clear and
well-written. It could be improved in its comparison to past work and
arguments for the relevance of the theoretical
contribution. Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a unified framework for the
high-dimensional analysis of dirty statistical models. The theoretical
analysis is sound and covers many statistical models as special cases. It
provides a guideline to analyze hybrid regularized learning problems. The
paper analyzes the equivalent relationship between the “dirty”
(superposition-structured) regularization and the “clean” regularization.
Furthermore, the paper extends the analysis of M-estimator for a single
regularizer in [11] to the settings of multiple regularizers where each
regularizer captures a specific structure. The theoretical analysis in
this paper is more general. The paper is well organized and written.
My only concern is whether the unified theoretical analysis
framework is consistent with that of some specific settings. The dirty
multi-task feature learning model [8] provides very strong theoretical
analysis. How about the comparison between [8] and this paper when the
loss function and regularizer have some specific forms?
Q2: Please summarize your review in 1-2
sentences
This paper provides a unified framework for
statistical models. The theoretical analysis is
solid. Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper provides bounds for the learning of
M-estimators minimizing a loss function with superposition-structured
regularizers. This type of statistical model features a parameter vector
which is a superposition of several vectors, each of them having a
specific structure enforced by its own regularizer. Examples of such
models have been the subject of very recent results (see in particular the
referenced papers from Chandrasekaran et al.). The present work provides
bounds for the learning of the parameter vector by extending the framework
of decomposable regularizers proposed in (Negahban et al. 2012). While the
general result is provided in Theorem 1, the resulting bound involves
several unknown constants which are specific to the statistical model. The
remaining of the paper studies several examples of such models and derives
explicit bounds for each of them. Similarly to (Negahban et al. 2012), the
resulting bounds emphasize the good behavior of structured regularizers
for high-dimensional problems when the target vector lies in an
appropriate structured subspace. Pros: The type of statistical
models ("dirty" models) for which bounds are derived has triggered
recently considerable interest from the statistical community, elegantly
providing convex relaxations to challenging problems. These problems
moreover touch areas directly relevant to machine learning (graphical
models, regression,...) and have clear potential applications. The
paper provides important insights in the factors determining the
performance of the superposition structured M-estimator. In particular,
the concept of Structural Incoherence is introduced (assumption C4) to
account for the interactions between the regularizers associated to each
structure. As stated in the paper, the provided bounds seem to
represent considerable improvement with respect to the state of the art
and are more general. Finally, the authors convincingly illustrate the
application of their general methodology in 3 interesting examples of
sparse statistical models. Cons: Although the authors claim their
result is very general, the assumptions on the model seem in practice
quite constraining: in particular, the family of appropriate penalties,
that are both norms and decomposable (as a sum of penalties over
orthogonal subspaces), seems restricted to immediate generalizations of
the l1-norm. It would be useful to discuss the possibility/consequences of
including additional regularizers such as the squared Euclidian norm
(typically: can this framework deal with elastic net penalties?). One
other issue is the applicability of the bounds to practical cases. As far
as I understand, the assumptions required to prove these bounds are
numerous and involve several unknown parameters. In particular, in the
first two presented examples (Proposition 2, Corollary 2, Proposition 3),
conditions for which assumptions hold are not established, especially all
results assume the C-linearity condition (7) to hold without further
justification. Corollary 4 assumes a (simpler) C-linearity condition (12)
as well. Moreover, in all examples, bounds depend on \bar{\kappa} or
directly on the curvature parameter \kappa_{\mathcal{L}}, for which no
range of acceptable values is provided. Is it possible to show for that
these assumptions hold with some degree of generality and for some values
of the parameters (curvature…), that might depend on the specific model.
Although very well written, the paper is also very dense and some
definitions/notations are difficult to understand without referring to
(Negahban et al. 2012), see the following comments for details.
Comments: -line 142: proposition 1 is important, please provide at
least references for the concept of infimal convolution. Although the
result seems intuitive, please make sure all necessary assumptions are
correctly listed. -line 157: please mention that the target parameter
is the minimizer of the population risk -line 168: the concept of
restricted set is introduced, however without referring to other readings,
it is difficult to understand how these sets are defined for each
structure and used. Please provide some explanation. -Corollaries 2, 3
and 4 do not mention whether the bound is obtained with high probability,
please clarify.
Typos: -line 230: "subspaces are chosen so
that $\prod$...$", I guess the projector on the subspace *orthogonal* to
$\mathcal{M}$ should vanish. -line 303: "we are now ready *to* derive"
Q2: Please summarize your review in 1-2
sentences
Good theoretical paper, providing insights in the
performance of recently proposed sparse statistical models with direct
relevance to machine learning. However, the range of models for which the
numerous required assumptions are valid is not clear.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their careful and
encouraging comments and feedback.
Reviewer 4:
Our
comparisons with Agarwal et al. were terse for lack of space, we will
expand upon these in the final version. As we had noted in the paper,
their bounds did not go to zero as a function of n; this was in part
because they did not impose any counterpart of our novel structural
incoherence condition (C4) on the interaction between the different
structured components; and imposed ell_infty bounds on the parameters
instead.
On \Omega_\alpha: As the reviewer astutely notes, we set
\Omega_\alpha = \Theta in all our corollaries; in fact, we intended to
specify in Theorem~1 that (C3) and (C4) be satisfied with \Omega_\alpha =
\Theta (alternatively, we could remove references to \Omega_\alpha
entirely). This is because the conditions (C3) and (C4) have 'tolerance'
terms; and thus do not need a restricted set of directions per se. (In
other words, in (C3) instead of requiring a quadratic lower bound for a
restricted set of directions, we can require a quadratic lower bound for
all directions but minus a tolerance term). See the discussion of the
restricted strong convexity condition in ref [12] which also just has a
tolerance term; instead of the restricted directions version in their
conference version in Negahban et al (2009). (In any case, the restricted
subset of \Theta we actually need the conditions (C3) and (C4) be
satisfied over, is specified by the set in eqn 16 in the supplementary
materials; but restricting to this set instead of all of \Theta do not
lead to better bounds given the choice of the tolerance term, so that we
could set \Omega = \Theta).
On more examples: we were constrained
due to lack of space; but we will add more examples to the supplementary
materials (e.g. low-rank + row sparse + column sparse structure, and so
on.)
Reviewer 5:
On comparison to ref [8]: there they
provided a different class of guarantees, where they analyzed the
sparsistency (recovery of the joint sparsity pattern) of their dirty
multitask estimator (specifically, sparse + block-sparse multiple linear
regression), for which they needed additional stronger irrepresentable
conditions. In this paper, we restrict ourselves to providing parameter
error bounds, for which we only impose much milder (restricted strong
convexity and structural incoherence) conditions.
Reviewer 6:
The class of decomposable regularizers is fairly broad, see
Negahban et al 2012; but we agree with the reviewer that extending the
analysis to beyond decomposable regularizers is an interesting and
important question for further research.
For \kappa_{\mathcal{L}}
(and therefore \bar{\kappa}), we can directly appeal to the previous
results for individual structures. For all the structures considered in
our examples, \kappa_{\mathcal{L}} is (a small) constant; when the design
X is drawn a multivariate Gaussian, and with the linear regression loss,
it depends on a restricted mininmum eigenvalue of the population
covariance matrix.
Thank you for correcting typos. We will fix
them.
| |