Paper ID: 230
Title: Fisher-Optimal Neural Population Codes for High-Dimensional Diffeomorphic Stimulus Representations
Reviews

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
[EDIT] I read the author's rebuttal, but it did not change my opinion.

The authors address the question how a population of linear-nonlinear neurons could optimally encode a high-dimensional (Gaussian) stimulus ensemble in a mean-squared error sense. Their main finding is that the optimal linear weights are not entirely orthogonal and tend to point in directions where the stimulus has low variance; the derivative of the optimal nonlinearity is the cubic root of the marginal distribution in the direction of the corresponding linear weight.

The results are novel and the derivations quite elegant. Having analytical results for the 2d case provides some nice intuitions and the explicit learning rule for the high-dimensional case could provide a good starting point for further studies. The figures are well done and illustrative.

Overall, I think the paper constitutes a true contribution to the population coding literature. However, in its current form it is very hard to digest. I had to read it multiple times and many parts only became clear much later.

One of the main problems is that some of the terms are either defined far away from where they are used or simply never defined at all and their meaning has to be guessed from the context or is mentioned only later. Examples are: \pi(s) (line 159), \Sigma (line 211.5), Var[ws] (line 241). In addition, the large amount of language mistakes in the text (usage of articles, third person s, tense etc.) doesn't facilitate reading. It would be useful to have the manuscript proof-read by a native speaker.

In terms of content my main concern is the use of Fisher information in combination with the statement that the Cramer-Rao bound can be attained in general asymptotically by and optimal decoder (line 387). While this is entirely true for the large T limit, it is equally irrelevant: the brain has to process information close to real time and simply doesn't operate in the large T limit. See e.g. Berens et al. 2011, which the authors cite, for an example where Fisher information in combination with optimal coding does not make useful predictions with respect to mean-squared error.

The CR bound is attained by the maximum likelihood decoder in the large N (neurons) limit for one-dimensional stimuli (e.g. Ecker et al. 2012), but it is unclear if it is useful in the situation the authors consider, where the number of neurons is equal to the number of dimensions. Ideally, numerical simulations of maximum likelihood decoding should be carried out to verify that the CR bound can actually be attained.

Even in the absence of direct verification I think that possibly an intuitive argument could be made if the representation is sufficiently overcomplete so that the large T limit can be replaced by a large N limit (would the math still work in such a situation?). However, given the available evidence regarding the danger of "blindly" using Fisher information, the authors should be somewhat more careful in stating their conclusions and ideally verify their claims directly.

Finally, it appears somewhat counter-intuitive that the linear weights should point in the direction with little variance in the stimulus. As far as I understand the reason is that the noise is added after the nonlinearity, which means that if the variance in a certain direction is high an additive error will be large in stimulus space. Since neurons try to have somewhat orthogonal weights if one neurons points in a direction of large variance few other will be informative about this direction. Thus, it appears to be beneficial to have a number of neurons, which all respond somewhat to that direction but aren't fully aligned. I think the general insight about how the weights should be organized could be discussed in more depth and possibly be illustrated better.
Q2: Please summarize your review in 1-2 sentences
A novel approach to the question of optimal neural tuning for high-dimensional stimuli. The validity of the use of Fisher information and the Cramer-Rao bound should have been verified and the paper needs substantial improvement in terms of language and clarity.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary

Finding the objective functions that regions of the nervous system are optimized for is a central question in neuroscience, providing a central computational principle behind neural representation in a given region. One common objective is to maximize the Shannon Information the neural response encodes about the input (infomax). This is supported by some experimental. Another is to minimize the decoding error when the neural population is decoded for a particular variable or variable. This has also been found to have some experimental evidence. These two different objectives are similar in some circumstances, giving similar predictions, in other cases they differ more.

Studies finding model optimal distributions of neural population tuning that minimizes decoding error (L2-min) have mostly considered 1-dimensional stimuli. In this paper the authors extend substantially on this, by developing analytical methods for finding the optimal distributions of neural tuning for higher dimensional stimuli. Their methods apply under certain limited conditions , such as when there is an equal number of neurons as stimulus dimensions (diffeomorphic). The authors compare their results to the infomax solution (in most detail for the 2D case), and find fairly similar results in some respects, but with two key differences. That the L2-min basis functions are more orthogonal than the infomax, and that the L2-min has discrete solutions rather than the continuum found for infomax. A consequence of these differences is that L2-min representations encode more correlated signals.


On the quality score, impact score and confidence score.
I chose accept (top 50%) as I thought it was a good and interesting paper.
I chose 2 (although this is a rather course grain measure, I probably would go for 1.7).
I chose 3 (fairly confident) but if I could chose 2.5 I would. I think I got the gist of the paper, and Sections 1-2 were clear. I got lost in several place in Sections 3-6 but I think I followed the general argument to a varying degree. Section 7 was clear.


Quality

Is the paper technically sound?
As far as I can tell, the paper is technically sound. The areas of the paper that I could confidently follow held up in my opinion.

Are the claims well supported?
I would say that the claims are well supported as well as I could follow them. A little more explanation as to why the more orthogonal solution of the L2-min encodes correlated signals would be useful.

Is this a complete piece of work?
For a NIPS paper I would say yes.

Are the authors careful (and honest) about evaluating both the strengths and weaknesses of the paper?
Yes, the authors are very careful and honest in evaluating the paper’s strength and weaknesses. In fact, I think they are a little too circumspect, and I would have liked just a little more discussion of its strengths and possibilities. However, I recognize there are space limits.


Clarity

Is the paper clearly written?
Where I understood the paper is seems moderately clearly written. Where I didn’t understand I don’t know if it was the authors lack of clarity or my lack of familiarity with what they were doing.

The clarity and consistency of the notation could be improved. The notation varies somewhat across the paper and terms are sometimes not defined. For example the don’t define E[ |s] and <> as expectation. They use n for the number of neurons and the number of spikes. They use ‘ for differentiation but also as simply a marker (as s’ in 2.4).

Is the paper well organized?
Yes.

Does the paper adequately inform the reader?
I would have liked some areas spelt out more. See above.


Originality

Are the problems or approaches new?
Yes, analytical methods for high dimensional L2-min representation are new as far as I know.

Is it clear how this work differs from previous contributions?
Yes.

Is related work adequately referenced?
Yes.


Significance

Are these results important?
This work could help form the basis of a more general research program that examines what general objective functions particular neural systems use. Thus I think the results are important.

Are other people likely to use these ideas?
Yes, I think they could be useful as the basis of future more specific and empirical analyses, and also expanded into a more general framework.

Does the paper address a difficult problem in a better way than previous research?
Yes, the analytical results are more flexible and secure and can be used to speed up numerical methods.

Does it advance the state of the art?
Yes

Does it provide a unique theoretical approach?
Yes.


Strengths
High-dimensional L2-min is novel. Novel analytical approach. Comparison with Infomax. Clear differences in solutions given.


Weaknesses
Restricted circumstances (diffeomorphic). A little unclear in places (may be due to reviewer).


Detailed review

Sections 1 and 2 are clear to me. Only minor points

070: As opposed to the encoding, the

070: function called

080: E[|s] is the expectation over r, but might be good to define it.

098: Perhaps you are using a notation I am not familiar with, but by analogy with the L2 norm, should it be the L22 loss?

101: In this equation I am more used to a ≥. Please define the meaning of the curved ≥ in this context.

113: Define s’. Later ‘ is used for derivative, please clear up this ambiguity.

In section 3 I get somewhat confused. Please give more detail on how Holder’s inequality is used to find the minimum?

163: …, the right side of Equation 10 is…

178: Use n for number of spikes in the appendix, need to change one of these variables.

Section 4. Also found this somewhat confusing.

189: double integral sign for a multi-dimensional integral is confusing. I suggest using placing … between them, or just having one integral sign.

193: as 189.

Sections 5 and 6. Found this somewhat confusing, but less than 3 and 4.

229: What happens to |wk| in the relationship between Equations 17 and 18? Confused here.

241: “Minimizing the numerator Var[wkTs] makes…” The numerator of which equation? I assume you are referring to equation 17, if so the relationship between Var[wkTs] and the numerator needs to be made more explicit.

321: Might want to briefly discuss the similarity and differences between Equation 16 and Equation 28.

371: Apologies if I am totally missing the point. I would imagine that a more orthogonal basis would results in less correlated signals. Where am I wrong? Please explain. A little more elaboration on this point in the text would be useful.

Section 7. Generally clear.

Section A. Clear.






Q2: Please summarize your review in 1-2 sentences
A novel analytical extension of optimization of Fisher Information based neural coding objective functions to high dimensional stimuli under certain restricted conditions. Appears solid and a good inspiration for further studies, if perhaps little unclear in places.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presented a new analysis of optimal population coding under the assumption of (a) gaussian signal statistics; (b) linear-nonlinear functional model with additive white gaussian noise; and (c) the so-called diffeomorphic mapping (i.e. mapping from input to output is invertible and smooth). Their main result is the optimal solution of linear-nonlinear model to minimize L2 reconstruction errors. I think this is significant. The authors also review the relevant works, in particular, the difference of their solution from the infomax solution.

Overall, I think the quality is above the threshold. And the figures are particularly well constructed.

There are, however, a few major drawbacks:

- Signal is assumed to be gaussian: it is well known, especially in NIPS community, that the signal statistics of interest are usually non-gaussian.

- Lack of application or theoretical prediction of the proposed model. Specifically, what do the linear filters and nonlinearities look like when the model is optimized for correlated gaussian signal of 8-by-8 pixels, for example? The authors should already have this results, considering section 5.3.

Minor comments:

- There are many typos, including: [line 062] consider a $n$ -> consider an $n$; [line 070] function call -> function called; [line 072] the respond -> the response; [line 381] to minimized -> to minimize.

- comparison to infomax: the authors refer it to as ICA, but because the signal is assumed to be gaussian, it is misleading and inaccurate. Isn't "whitening" appropriate here?

- equations 1 and 2: I don't think $T$ and $V$ are defined in the manuscript.
Q2: Please summarize your review in 1-2 sentences
I think this is a good paper.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
First of all, we would like to thank all the reviewers for their efforts and insightful comments. In general, we agree the weakness sides of our paper pointed out by the reviewers. In this rebuttal, we would like to comment on a few aspects that could have been better emphasized in our original submission. (A) The pre-requisite condition to use Cramer-Rao bound. (B) The assumption of Gaussian signal. (C) Lack of application or theoretical prediction. (D) Miscellaneous.

(A) Condition to Use Cramer-Rao Lower Bound (CRLB)

The CRLB is known to be sharp given long encoding time (T->infty) and proper decoder. We agree with the reviewer who pointed out that such requirement may undercut the generality and availability of our result. There are indeed some equivalent conditions including multiple copies of neurons (N->infty).

In the large N case, copies of optimal neurons in the diffeomorphic case (#of neurons = # of dimension) is still optimal under one additional constraint -- each neuron must actively encode all possible outcome of its filtered input as h(w_k * s), i.e. the non-linear function "h" is only determined by the marginal distribution \pi_k(w_k s), regardless of what other neuron is doing. Counter example exists in 1D case (see fig.6 in Bethge, Rotermund and Pawelzik (2002)).

Another point we would like to point out here is that the objective function for L2-optimization also plays a role in infomax, with the consideration of non-zero input noise. If we consider ratio = V(input noise) / V(neural noise). In the case of large ratio, it can be shown that, when input noise dominates, infomax solution is just given by the L2 solution we provided.

(B) Gaussian Signal Assumption

We do agree with the reviewers that non-Gaussian signals would be more interesting for the NIPS community.

Our method offers the possibility of analyzing the non-Gaussian case. When we formulated the general optimization problem in Section 4 Eq(15), we tried to keep it as general as possible. The cube term in Eq(15) has the same unit as stimulus square which is related to the variance of the marginal distribution of w_k * s. The specific constants (such as 6*sqrt(3)*pi for Gaussian) may vary for different families of distributions. Gaussian distribution is special because no other distribution would have marginals from the same family of distributions. However under certain constraint of the prior distribution, the constants may be bounded to enable a similar analysis we gave in Section 5.1. In addition, even if analytical solution may not exist for arbitrary stimulus prior distribution, numerical solutions for L2-min and ICA can still be calculated and compared using Eq (15).

In our manuscript, we assume the input signal to have Gaussian distribution for the purpose of a fast arrival of the conclusion, thanks to the analytical tractability. The solution for Gaussian case itself can already provide us some insight about different possible encoding strategies that neural system could possibly use. In the Gaussian case the ICA solution is exactly the same as whitening, as one of the reviewers pointed out.

(C) Applications / Theoretical Predictions

The results we derived can be applied to signal processing or make predictions about neural population. One example is that, we can explicitly show the L2-optimal basis for 8x8 pixels gray-scale image with any 64D-Gaussian prior. Another prediction we can make using our result is that, we can predict the locations of centers of 2D-bell shaped receptive fields for a neural population. Those results were not included in the submission due to space limit. But they will certainly be presented if our manuscript is accepted.

(D) Miscellaneous

We would like to apologize for the extra effort that could have been saved by better clarity of our manuscript. We will polish the language and improve clarity to make our paper more reader-friendly.

Section 3.

Holder's inequality implies that (\int |f| dx ) (\int |g| dx) (\int |g| dx) >= (\int |f*g*g|^{1/3} dx)^3 and the equality holds when |f| = c * |g|. Here f = \pi / h'^2, g = h' and they are both positive. Therefore the optimal value is attained when f = g, i.e. \pi = c * h'^3.

Section 5.

In Eq(17) and Eq(18), the magnitude of |w_k| is chosen to be 1 since we have make this assumption in Section 2, line 126. The magnitude can be fixed to any constant because it can be compensated by choosing the non-linearity h.

Section 6.

A more orthogonal basis would result in less correlated signals, only when the signal itself is uncorrelated. If we consider a correlated signal, then ICA will perform the whitening process which gives you non-orthogonal basis in general. As our manuscript pointed out, the L2-min criterion will produce a relatively more orthogonal basis compare to the ICA solution.