Paper ID: 271
Title: Least Informative Dimensions
Reviews

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Review for # 271: “Least Informative Dimensions”

This paper addresses the problem of finding the low dimensional stimulus subspaces (of a high dimensional stimulus space) that most influence the spiking of a neuron or potentially other process. In contrast to Maximally informative dimensions, which directly looks for informative subspaces, this method tries to decompose the stimulus space into two orthogonal subspaces by making one of them as uninformative about the spiking as possible. The methodology is highly advanced, involving replacing the relative entropy by an approximate “integral probability metric” and representing this metric in terms of a kernel function. After making some simplifying assumptions regarding the form of this kernel, the authors use a gradient ascent. They give detailed formulas for the required derivatives in the supplement. They then apply their method to several simulated data sets and to a neuron recorded in the electric fish. Using a shuffling procedure to calculate confidence bound upon the integral probability metric they show that for their examples, this procedure finds a subspace of the correct dimension which contains all the relevant information in the stimulus.

This paper was somewhat dense, but I enjoyed it once I figured out what was going on. The problem of finding informative subspaces is an important one, and new methods are welcome. The math is rigorous and the examples in the results section, while somewhat simple, are appropriate for a NIPS paper given the space constraints.

One thing that is missing is any mention of computation times and a comparison with more standard methods such as Sharpee’s maximally informative dimensions. Given that the authors are motivating their work by the computational intensity of directly estimating the mutual information, I think that they should compare the two methods both in their computation time, and also their accuracy.

A second question I had regarded the firing rates used in their simulated examples. They state on line 257 that “35% non-zero spike counts were obtained” for the LNP neuron. This seems rather high, as in 350 Hz at 1ms resolution and 35 Hz at 10 ms resolution. Can the method detect the relevant dimension (particularly in the complex cell example) at lower resolution?

In summary I think this is a nice paper suitable for NIPS if the above concerns are addressed.
Q2: Please summarize your review in 1-2 sentences
A new method for finding stimulus subspaces which are informative about neural spiking (or other random processes). Very mathematically rigorous and quite interesting. However, no direct comparison with other methods (such as maximally informative dimensions) are provided and given the claim of efficiency (or at least the motivation of efficiency) details such as computation times etc. should be included.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper is devoted to the problem of finding relevant stimulus features from the neural responses. The main idea is to minimize the information between a certain set of features and the neural responses combined with the remaining stimulus features in order to separate the relevant from irrelevant features. The method builds on the prior work by Fukumizu and colleagues, but unfortunately makes an additional restriction that greatly diminishes the potential use of this method. Specifically, the irrelevant and relevant dimensions need to be statistically independent from each other within the stimulus ensemble. While the authors argue that this is not a crucial assumption “because many distributions are Gaussian for which the dependencies between U [informative features] and V[the uninformative features] can be removed by pre-whitening the stimulus data before training”, this assumption essentially eliminates the novel contribution from this method. This is because with correlated Gaussian stimuli one can use spike-triggered covariance (STC) to find all features that are informative about the neural response in one simple computational step that avoids the non-convexity and complexity of the proposed optimization. Furthermore, the STC method has recent been shown to apply to elliptically symmetric stimuli (Samengo & Gollisch J Computational Neuroscience 2013). Finally, the proposed method has to iteratively increase the number of relevant features, whereas this is not needed in the STC method.

In terms of practical implementations, the method was only demonstrated in this paper on white noise signals. Even in this simple case, the performance exhibited substantial biases. For example, in Figure 1, right panel, a substantial component along the irrelevant subspace remained after optimization. Note that it is not correct to claim that (lines 286-888, and also lines 315-316) that “LID correctly identifies the subspace (blue dashed) in which the two true filters (solid black) reside since projections of the filters on the subspace closely resemble the original filters. “ The point of this method is to *eliminate* irrelevant dimensions, and this was not achieved.

The superior convergence properties of the proposed techniques relatively to spike triggered average/covariance were not demonstrated. In fact, the method did not even remove biases in the STA, which would be possible using the linear approximation (Figure 3). Note that the Figure 3 description was contradictory: stimuli were described as white noise, yet the footnote stated that the data were not whitened.

The use of information minimization for recovering multiple features and correlated stimuli was demonstrated in Fitzgerald et al. PLoS Computational Biology 2011, “Second-order dimensionality reduction using minimum and maximum mutual information models.”

Line 234-235: it is not correct that MID (also STC) “requires data that contains several spike responses to the same stimulus” – one can optimize the relative entropy based on a sequence of stimuli presented only once in either MID/STC methods.

Lines 236-237: MID/STC can be used with spike patterns by defining events across time bins or different neurons.

The abstract was written in a somewhat misleading manner, because it is not the information that is being minimized, but a related a quantity in combination with crucial assumptions that are not spelled out in the abstract. For example, that kernels need to factorize. The last statement in the abstract is also not clear “… if we can make the uninformative features independent of the rest.” Here, it is not clear that this implies that inputs are essentially required to be uncorrelated if any of the input features can be informative for one of the neurons in a dataset.
Q2: Please summarize your review in 1-2 sentences

This manuscript describes a method for finding multiple features in neuroscience dimensions. The paper is written fairly clearly. Unfortunately, this method implementation leaves much to be desired compared to the existing methods (e.g. STC) and does not offer new any capabilities because it requires uncorrelated inputs with which STC can be used.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The gist of this paper is essentially a non-parametric method that can be
used to estimate Maximally Informative Dimensions (MID; due to Sharpee et
al). MID is one of the only spike-triggered characterization techniques
that is capable of extracting multiple dimensions that are informative with
respect to higher-order statistics (unlike classic methods such as STA/STC
which rely on changes in mean and variance, respectively). Extending MID to
multiple dimensions is also difficult due to repeated estimation of
relative entropy which quickly becomes intractable due to the amount of
data required as dimensionality increases. Here, the authors use an
'indirect' estimation approach, whereby they can estimate maximally
informative features without actually having to estimate mutual
information.

This paper is largely excellent. The methodology itself is similar in flavour to
the work of Fukumizu et al, but the similarities and conceptual differences
are discussed in detail by the authors. My main criticism is that the paper
is very methods heavy, with a slightly less focus on experiments. This
really does seem like a great estimation technique, and I do not feel that
the experiments sections does it justice. Ideally, I would like to have
seen a detailed comparison between MID and LID in real data (perhaps visual
recordings, similar to the data commonly presented in papers by the Sharpee
group). I appreciate that the authors did present some real data
application in the form of P-unit recordings from weakly electric fish,
although one might argue that this data is non standard in the sense that
the vast majority of spike-triggered characterization techniques are
applied to either the visual or auditory domains. The most disappointing
aspects of the results however, is that the authors do not seem to address
the issue of taking MID to high dimensions - which is one of the key areas
in which current MID estimation algorithms struggle. Some examples of this
can be found in the recent papers by Atencio, Sharpee, and Schreiner, where
the MID approach is applied to the auditory system. An approach like this
could benefit these kind of studies greatly, giving them the ability to
investigate higher-dimensional feature spaces.
A final comment is one of computational efficiency - given that such an
approach is likely to be applied to vast amounts of neural data, it would
be nice to get a feel for how long the algorithm takes to run for data sets
of different sizes.

The paper itself is very well written and particularly clear. It does not
seem to suffer from any obvious grammatical or mathematical errors (at
least, to my eyes).

I very much enjoyed this paper. It is a great example of what a good NIPS
paper should be - it uses modern machine learning methods to study problems
in neuroscience. Spike-triggered characterization methods have been used
for decades, and yet the seemingly simple problem of dealing with
high-dimensional feature spaces has been plagued with difficulties. This
paper provides steps towards solving this problem, which will have long
lasting implications in both neuroscience and machine learning communities.
The only negative aspect of this paper, as I mentioned earlier, is that the
results section is somewhat lacking, and does not fully reflect how good
the estimation technique seems to be.
Q2: Please summarize your review in 1-2 sentences
An excellent methodology paper, only slightly lacking in an adequate display of results and relevant comparisons to related methods.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
I think this paper has some significant strengths - I certainly rank it higher than a 2. There are some nice ideas; it's a well-written paper, and I enjoyed reading it. But there are some significant weaknesses here that need to be addressed.

The basic question that is not addressed sufficiently directly here: is the proposed method actually better than MID or the method of Fukumizu et al, either in terms of computational efficiency or statistical accuracy? (Presumably not the latter, since MID is asymptotically efficient.) Direct comparisons need to be provided. In addition, the numerical examples provided here are rather small.  How well do the methods scale? The authors really need to do a better job making the point that the proposed methods offer something that improves on the state of the art.

In addition, a basic weakness of the proposed method should be addressed more clearly. Specifically, on L77, "if Q can be chosen such that I [Y ;U : V ] = 0" - as noted by reviewer 2, this typically won't happen in the case of naturalistic stimuli X (unlike the case of gaussian X, where we can easily prewhiten), where these information-theoretic methods are of most interest. So this makes the proposed methods seem significantly less attractive. This is a key point here, and needs to be addressed more effectively.


Minor comments -

L81 - "another dependency measure which ... shares its minimum with mutual information, that is, it is zero if and only if the mutual information is zero." - these two statements don't appear to be equivalent in general.  if there is no Q for which the corresponding MI is zero, then it's not clear that these two dependency measures will in fact share the same (arg) min.  this point should be clarified here.

Maybe worth writing out a brief derivation for eq (2), at least in the supplement.

It would be nice to have a clearer description of how the incomplete cholesky decomposition helps - at least mention what the computational complexity is.  The kernel ICA paper by Bach and Jordan does a good job of laying out these issues. Note that we still have to recompute the incomplete cholesky for each new Q (if I understand correctly).

The relative drawbacks of MID seem overstated.  While Sharpee's implementation of MID bins the projected data, there's no conceptual need to do this - a kernel (unbinned) estimator for the required densities could be used instead (and would probably make the gradients easier to compute), or other estimators for the information could be used that don't require density estimates at all.  Similarly, as one of the other reviewers notes, the MID method could easily be applied to multi-spike patterns.  I also agree with the second reviewer that L235 is incorrect and needs revision.
Q2: Please summarize your review in 1-2 sentences
I think this paper has some significant strengths. There are some nice ideas; it's a well-written paper, and I enjoyed reading it. But there are some significant weaknesses here that need to be addressed.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank all reviewers for their constructive feedback. We will make an effort to address all your minor comments in a revised version of the paper. Here, we hope to clarify the major points.

First of all, we would like to stress that our primary motivation was not to improve the runtime of MID but to offer a generalization that extends to populations and spike patterns (we explain below why we think this is difficult with MID). Given the page constraint, we felt that it is more important to demonstrate our method on controlled examples of increasing complexity than to compare it to other algorithms. Furthermore, we are unsure how meaningful a runtime comparison would be, since LID solves a more general problem than MID. We agree that a more extensive experimental exploration is the next step after demonstrating the effectiveness of the method on controlled examples. We will do that in the near future. In the final version of the paper, we will adapt our discussion of MID to avoid giving the impression our goal was to improve the runtime.

Concerning the generality of our method, reviewer #6(2) claimed that it would not yield any novel contribution beyond STC for Gaussian stimuli since STC were sufficient to extract all informative features for correlated Gaussian inputs already. This is not the case for two reasons:

1. STC and STA characterize the spike triggered ensemble which essentially captures the informative features for one single bin. In contrast, our method captures informative stimulus features about more complex responses, including patterns of spikes or population responses. Therefore, our method solves a more general problem than STC and STA independent of the stimulus distribution.

2. STC does not necessarily identify all informative features, even for Gaussian stimuli. This is because the spike triggered distribution does not need to be Gaussian anymore as this simple example illustrates: Let x in R^2 and ~N(0,I), and let y=1 if a<=|x_1|<=b and 0 otherwise, where a and b are chosen such that var(x_1)=1 (this can easily be done using the cdf and ppf of the Gaussian). The spike triggered distribution p(x|y=1) is still white and has mean zero. Therefore, neither STA nor STC can detect the dependencies between stimulus and response. Our algorithm, on the other hand, detects the correct subspace [1,0]'. A similar example can be constructed for correlated Gaussian stimuli. Therefore, our algorithm is more general than STA and STC, even for Gaussian stimuli.

Reviewer #6(2) also raised the concern that a substantial component of the irrelevant dimensions remains after optimization (Fig 1 right panel). Here, we would like to stress that LID identifies the informative _subspace_. Thus, the resulting basis vectors can be flipped or linear combinations of the original filters. If the original filters are not orthogonal, the subspace basis will look different from the filters (since Q is in SO(n)). If the subspace is determined correctly, however, it should contain the original filters. This is indeed the case for Fig. 1/right because substantial parts of the projected filters would be missing otherwise. We will make this point clearer in a final version.

Our reasoning why an extension of MID to patterns and populations is not straightforward is the following: MID maximizes the information between a _single_ spike and the projection of the stimuli onto a particular subspace (which is not equal to I[Y:v'*X], see related work section in the paper). Assuming that the p(v'x|no spike) is very similar to the prior distribution p(v'x), the information of a single spike is approximately the information carried by a single bin. As the single bins can be correlated, it may be desirable to extend MID to several bins. When using spike patterns or population responses, however, there are several pattern triggered ensembles p(v'x|pattern 1), p(v'x|pattern 2), ... which are unequal to the prior distribution p(v'x) and, therefore, carry information (see the equation in line 232 in the related work section). In that case, I[Y:v'*X] has a term for each of those patterns and MID needs to estimate the information for each of them in order to account for the full stimulus response information. Depending on the number of patterns, this can become substantially more involved.

Concerning the independence between U and V, and the choice of Q such that I[Y,U:V]=0, we would like to emphasize that we obtain several advantages by phrasing the objective as I[Y,U:V]= I[Y:X] + I[U:V] − I[Y:U] while maintaining applicability to most common stimulus distributions because I[U:V]=const for most of them. This means that minimizing I[Y,U:V] becomes equivalent to maximizing I[Y:U]. Clearly, for white and correlated Gaussians, I[U:V]=0 after pre-whitening. For elliptically symmetric distributions, pre-whitening yields a spherically symmetric distribution which means that I[U:V]=const, since U and V correspond to coordinates in an orthogonal basis. Since natural signals like images and sound are well described by elliptically symmetric distributions (see work by Lyu/Simoncelli and Hosseini/Sinz/Bethge) I[U:V]=const as well. This also means that even if Q cannot be chosen such that I[Y,U:V]=0, the objective function is still sensible. The only reason why we need I[Y,U:V]=0 is to tie the minimum of the integral probability metric (IPM) to the minimum of the Shannon information. However, even if that minimum cannot be attained, the IPM objective will still yield reasonable features.

In return for the additional restriction and the use of the IPM we get an algorithm that (i) naturally extends to patterns and populations while maintaining computational feasibility, (ii) can assess the Null distribution via permutation analysis (as opposed to Fukumizu et al.), and (iii) does not have to use factorizing joint kernels (as opposed to Fukumizi et al.; we only chose factorizing kernels for convenience).