|
Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
[EDIT] I read the author's rebuttal, but it did not
change my opinion.
The authors address the question how a
population of linear-nonlinear neurons could optimally encode a
high-dimensional (Gaussian) stimulus ensemble in a mean-squared error
sense. Their main finding is that the optimal linear weights are not
entirely orthogonal and tend to point in directions where the stimulus has
low variance; the derivative of the optimal nonlinearity is the cubic root
of the marginal distribution in the direction of the corresponding linear
weight.
The results are novel and the derivations quite elegant.
Having analytical results for the 2d case provides some nice intuitions
and the explicit learning rule for the high-dimensional case could provide
a good starting point for further studies. The figures are well done and
illustrative.
Overall, I think the paper constitutes a true
contribution to the population coding literature. However, in its current
form it is very hard to digest. I had to read it multiple times and many
parts only became clear much later.
One of the main problems is
that some of the terms are either defined far away from where they are
used or simply never defined at all and their meaning has to be guessed
from the context or is mentioned only later. Examples are: \pi(s) (line
159), \Sigma (line 211.5), Var[ws] (line 241). In addition, the large
amount of language mistakes in the text (usage of articles, third person
s, tense etc.) doesn't facilitate reading. It would be useful to have the
manuscript proof-read by a native speaker.
In terms of content my
main concern is the use of Fisher information in combination with the
statement that the Cramer-Rao bound can be attained in general
asymptotically by and optimal decoder (line 387). While this is entirely
true for the large T limit, it is equally irrelevant: the brain has to
process information close to real time and simply doesn't operate in the
large T limit. See e.g. Berens et al. 2011, which the authors cite, for an
example where Fisher information in combination with optimal coding does
not make useful predictions with respect to mean-squared error.
The CR bound is attained by the maximum likelihood decoder in the
large N (neurons) limit for one-dimensional stimuli (e.g. Ecker et al.
2012), but it is unclear if it is useful in the situation the authors
consider, where the number of neurons is equal to the number of
dimensions. Ideally, numerical simulations of maximum likelihood decoding
should be carried out to verify that the CR bound can actually be
attained.
Even in the absence of direct verification I think that
possibly an intuitive argument could be made if the representation is
sufficiently overcomplete so that the large T limit can be replaced by a
large N limit (would the math still work in such a situation?). However,
given the available evidence regarding the danger of "blindly" using
Fisher information, the authors should be somewhat more careful in stating
their conclusions and ideally verify their claims directly.
Finally, it appears somewhat counter-intuitive that the linear
weights should point in the direction with little variance in the
stimulus. As far as I understand the reason is that the noise is added
after the nonlinearity, which means that if the variance in a certain
direction is high an additive error will be large in stimulus space. Since
neurons try to have somewhat orthogonal weights if one neurons points in a
direction of large variance few other will be informative about this
direction. Thus, it appears to be beneficial to have a number of neurons,
which all respond somewhat to that direction but aren't fully aligned. I
think the general insight about how the weights should be organized could
be discussed in more depth and possibly be illustrated
better. Q2: Please summarize your review in 1-2
sentences
A novel approach to the question of optimal neural
tuning for high-dimensional stimuli. The validity of the use of Fisher
information and the Cramer-Rao bound should have been verified and the
paper needs substantial improvement in terms of language and
clarity. Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Summary
Finding the objective functions that
regions of the nervous system are optimized for is a central question in
neuroscience, providing a central computational principle behind neural
representation in a given region. One common objective is to maximize the
Shannon Information the neural response encodes about the input (infomax).
This is supported by some experimental. Another is to minimize the
decoding error when the neural population is decoded for a particular
variable or variable. This has also been found to have some experimental
evidence. These two different objectives are similar in some
circumstances, giving similar predictions, in other cases they differ
more.
Studies finding model optimal distributions of neural
population tuning that minimizes decoding error (L2-min) have mostly
considered 1-dimensional stimuli. In this paper the authors extend
substantially on this, by developing analytical methods for finding the
optimal distributions of neural tuning for higher dimensional stimuli.
Their methods apply under certain limited conditions , such as when there
is an equal number of neurons as stimulus dimensions (diffeomorphic). The
authors compare their results to the infomax solution (in most detail for
the 2D case), and find fairly similar results in some respects, but with
two key differences. That the L2-min basis functions are more orthogonal
than the infomax, and that the L2-min has discrete solutions rather than
the continuum found for infomax. A consequence of these differences is
that L2-min representations encode more correlated signals.
On
the quality score, impact score and confidence score. I chose accept
(top 50%) as I thought it was a good and interesting paper. I chose 2
(although this is a rather course grain measure, I probably would go for
1.7). I chose 3 (fairly confident) but if I could chose 2.5 I would. I
think I got the gist of the paper, and Sections 1-2 were clear. I got lost
in several place in Sections 3-6 but I think I followed the general
argument to a varying degree. Section 7 was clear.
Quality
Is the paper technically sound? As far as I can tell, the
paper is technically sound. The areas of the paper that I could
confidently follow held up in my opinion.
Are the claims well
supported? I would say that the claims are well supported as well as I
could follow them. A little more explanation as to why the more orthogonal
solution of the L2-min encodes correlated signals would be useful.
Is this a complete piece of work? For a NIPS paper I would say
yes.
Are the authors careful (and honest) about evaluating both
the strengths and weaknesses of the paper? Yes, the authors are very
careful and honest in evaluating the paper’s strength and weaknesses. In
fact, I think they are a little too circumspect, and I would have liked
just a little more discussion of its strengths and possibilities. However,
I recognize there are space limits.
Clarity
Is the
paper clearly written? Where I understood the paper is seems
moderately clearly written. Where I didn’t understand I don’t know if it
was the authors lack of clarity or my lack of familiarity with what they
were doing.
The clarity and consistency of the notation could be
improved. The notation varies somewhat across the paper and terms are
sometimes not defined. For example the don’t define E[ |s] and <> as
expectation. They use n for the number of neurons and the number of
spikes. They use ‘ for differentiation but also as simply a marker (as s’
in 2.4).
Is the paper well organized? Yes.
Does the
paper adequately inform the reader? I would have liked some areas
spelt out more. See above.
Originality
Are the
problems or approaches new? Yes, analytical methods for high
dimensional L2-min representation are new as far as I know.
Is it
clear how this work differs from previous contributions? Yes.
Is related work adequately referenced? Yes.
Significance
Are these results important? This
work could help form the basis of a more general research program that
examines what general objective functions particular neural systems use.
Thus I think the results are important.
Are other people likely to
use these ideas? Yes, I think they could be useful as the basis of
future more specific and empirical analyses, and also expanded into a more
general framework.
Does the paper address a difficult problem in a
better way than previous research? Yes, the analytical results are
more flexible and secure and can be used to speed up numerical methods.
Does it advance the state of the art? Yes
Does it
provide a unique theoretical approach? Yes.
Strengths
High-dimensional L2-min is novel. Novel analytical approach.
Comparison with Infomax. Clear differences in solutions given.
Weaknesses Restricted circumstances (diffeomorphic). A
little unclear in places (may be due to reviewer).
Detailed
review
Sections 1 and 2 are clear to me. Only minor points
070: As opposed to the encoding, the
070: function called
080: E[|s] is the expectation over r, but might be good to define
it.
098: Perhaps you are using a notation I am not familiar with,
but by analogy with the L2 norm, should it be the L22 loss?
101:
In this equation I am more used to a ≥. Please define the meaning of the
curved ≥ in this context.
113: Define s’. Later ‘ is used for
derivative, please clear up this ambiguity.
In section 3 I get
somewhat confused. Please give more detail on how Holder’s inequality is
used to find the minimum?
163: …, the right side of Equation 10
is…
178: Use n for number of spikes in the appendix, need to
change one of these variables.
Section 4. Also found this somewhat
confusing.
189: double integral sign for a multi-dimensional
integral is confusing. I suggest using placing … between them, or just
having one integral sign.
193: as 189.
Sections 5 and 6.
Found this somewhat confusing, but less than 3 and 4.
229: What
happens to |wk| in the relationship between Equations 17 and 18? Confused
here.
241: “Minimizing the numerator Var[wkTs] makes…” The
numerator of which equation? I assume you are referring to equation 17, if
so the relationship between Var[wkTs] and the numerator needs to be made
more explicit.
321: Might want to briefly discuss the similarity
and differences between Equation 16 and Equation 28.
371:
Apologies if I am totally missing the point. I would imagine that a more
orthogonal basis would results in less correlated signals. Where am I
wrong? Please explain. A little more elaboration on this point in the text
would be useful.
Section 7. Generally clear.
Section A.
Clear.
Q2: Please summarize
your review in 1-2 sentences
A novel analytical extension of optimization of Fisher
Information based neural coding objective functions to high dimensional
stimuli under certain restricted conditions. Appears solid and a good
inspiration for further studies, if perhaps little unclear in
places. Submitted by
Assigned_Reviewer_7
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presented a new analysis of optimal
population coding under the assumption of (a) gaussian signal statistics;
(b) linear-nonlinear functional model with additive white gaussian noise;
and (c) the so-called diffeomorphic mapping (i.e. mapping from input to
output is invertible and smooth). Their main result is the optimal
solution of linear-nonlinear model to minimize L2 reconstruction errors. I
think this is significant. The authors also review the relevant works, in
particular, the difference of their solution from the infomax solution.
Overall, I think the quality is above the threshold. And the
figures are particularly well constructed.
There are, however, a
few major drawbacks:
- Signal is assumed to be gaussian: it is
well known, especially in NIPS community, that the signal statistics of
interest are usually non-gaussian.
- Lack of application or
theoretical prediction of the proposed model. Specifically, what do the
linear filters and nonlinearities look like when the model is optimized
for correlated gaussian signal of 8-by-8 pixels, for example? The authors
should already have this results, considering section 5.3.
Minor
comments:
- There are many typos, including: [line 062] consider a
$n$ -> consider an $n$; [line 070] function call -> function called;
[line 072] the respond -> the response; [line 381] to minimized ->
to minimize.
- comparison to infomax: the authors refer it to as
ICA, but because the signal is assumed to be gaussian, it is misleading
and inaccurate. Isn't "whitening" appropriate here?
- equations 1
and 2: I don't think $T$ and $V$ are defined in the manuscript.
Q2: Please summarize your review in 1-2
sentences
I think this is a good paper.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
First of all, we would like to thank all the reviewers
for their efforts and insightful comments. In general, we agree the
weakness sides of our paper pointed out by the reviewers. In this
rebuttal, we would like to comment on a few aspects that could have been
better emphasized in our original submission. (A) The pre-requisite
condition to use Cramer-Rao bound. (B) The assumption of Gaussian signal.
(C) Lack of application or theoretical prediction. (D) Miscellaneous.
(A) Condition to Use Cramer-Rao Lower Bound (CRLB)
The
CRLB is known to be sharp given long encoding time (T->infty) and
proper decoder. We agree with the reviewer who pointed out that such
requirement may undercut the generality and availability of our result.
There are indeed some equivalent conditions including multiple copies of
neurons (N->infty).
In the large N case, copies of optimal
neurons in the diffeomorphic case (#of neurons = # of dimension) is still
optimal under one additional constraint -- each neuron must actively
encode all possible outcome of its filtered input as h(w_k * s), i.e. the
non-linear function "h" is only determined by the marginal distribution
\pi_k(w_k s), regardless of what other neuron is doing. Counter example
exists in 1D case (see fig.6 in Bethge, Rotermund and Pawelzik (2002)).
Another point we would like to point out here is that the
objective function for L2-optimization also plays a role in infomax, with
the consideration of non-zero input noise. If we consider ratio = V(input
noise) / V(neural noise). In the case of large ratio, it can be shown
that, when input noise dominates, infomax solution is just given by the L2
solution we provided.
(B) Gaussian Signal Assumption
We do
agree with the reviewers that non-Gaussian signals would be more
interesting for the NIPS community.
Our method offers the
possibility of analyzing the non-Gaussian case. When we formulated the
general optimization problem in Section 4 Eq(15), we tried to keep it as
general as possible. The cube term in Eq(15) has the same unit as stimulus
square which is related to the variance of the marginal distribution of
w_k * s. The specific constants (such as 6*sqrt(3)*pi for Gaussian) may
vary for different families of distributions. Gaussian distribution is
special because no other distribution would have marginals from the same
family of distributions. However under certain constraint of the prior
distribution, the constants may be bounded to enable a similar analysis we
gave in Section 5.1. In addition, even if analytical solution may not
exist for arbitrary stimulus prior distribution, numerical solutions for
L2-min and ICA can still be calculated and compared using Eq (15).
In our manuscript, we assume the input signal to have Gaussian
distribution for the purpose of a fast arrival of the conclusion, thanks
to the analytical tractability. The solution for Gaussian case itself can
already provide us some insight about different possible encoding
strategies that neural system could possibly use. In the Gaussian case the
ICA solution is exactly the same as whitening, as one of the reviewers
pointed out.
(C) Applications / Theoretical Predictions
The results we derived can be applied to signal processing or make
predictions about neural population. One example is that, we can
explicitly show the L2-optimal basis for 8x8 pixels gray-scale image with
any 64D-Gaussian prior. Another prediction we can make using our result is
that, we can predict the locations of centers of 2D-bell shaped receptive
fields for a neural population. Those results were not included in the
submission due to space limit. But they will certainly be presented if our
manuscript is accepted.
(D) Miscellaneous
We would like to
apologize for the extra effort that could have been saved by better
clarity of our manuscript. We will polish the language and improve clarity
to make our paper more reader-friendly.
Section 3.
Holder's inequality implies that (\int |f| dx ) (\int |g| dx)
(\int |g| dx) >= (\int |f*g*g|^{1/3} dx)^3 and the equality holds when
|f| = c * |g|. Here f = \pi / h'^2, g = h' and they are both positive.
Therefore the optimal value is attained when f = g, i.e. \pi = c * h'^3.
Section 5.
In Eq(17) and Eq(18), the magnitude of |w_k| is
chosen to be 1 since we have make this assumption in Section 2, line 126.
The magnitude can be fixed to any constant because it can be compensated
by choosing the non-linearity h.
Section 6.
A more
orthogonal basis would result in less correlated signals, only when the
signal itself is uncorrelated. If we consider a correlated signal, then
ICA will perform the whitening process which gives you non-orthogonal
basis in general. As our manuscript pointed out, the L2-min criterion will
produce a relatively more orthogonal basis compare to the ICA
solution.
| |