Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
1. The computational-statistical tradeoff shown in this paper is novel, which seems a significant contribution to understanding the single index models. 2. The paper is highly technical, and it is difficult to understand every single piece of the proofs. Therefore for the main paper, I would like to see a more gentle description of model (2.6) and the associated hypothesis testing problem. In the current flow of the paper, it is hard to gain much intuition into why we are interested in such a model and how the lower bounds of the hypothesis testing can imply the lower bound for parameter estimation.
The paper first introduces first-order and second-order Stein's identity and then defines two function sets, C1 and C2, characterized by the covariance between f and X^T\beta^*. Further, authors define a common function set C(psi), which includes all link functions such that the second-order Stein's identity does not vanish under transformation psi. Then, authors propose a mixed model in 2.6 using two link functions f1\in C1\cap C(\psi) and f2\in C2\cap C(\psi). This model is finally used to derive lower bound. This is reasonable since true beta with link function f1 is easy to estimate (using first-order Stein's identity), while true beta with f2 is indistinguishable. The minimax rate is established in Prop 3.1. Analogously, the paper also discusses the computational minimax rate via oracle computational model in the second part. The paper is not in a good quality due to many typos. But my main concern is about the significance of this paper and contribution. 1, Stein's association in 2.1 is more like a proof technique, which is used to define the function sets and further define the mixed model. The usefulness and necessity of a general transformation psi are not clear. One only need very specific psi and also specific f1 and f2, e.g. psi(y) = y^2, to derive lower bound. It would be better if authors discuss the importance of generalization of psi in Def 2.1, otherwise the this directly comes from . 2, Why we need normalization constant \|beta^*\| in C1, instead of directly assuming \|beta^*\| = 1 (the constant can be absorbed in f). 3, It seems the mixed model defined in 148-150 is closely related with . Authors should distinguish this work from . What's the main difference (and difficulty) for using this mixed model to derive lower bound. Are these two works basically solving the same problem? 4, In the main paper, authors emphasize the lower bound only, without upper bound. So writing the contribution paragraph need be careful. It seems the upper bound has different conditions as lower bound (A.2). It would be better to clarify this in main context. 5, Comparing with proof in [16, 53, 62], I don't get the significance of two lower bounds in paper, which I thought are main contributions. Is the proof same as . Although we have index model here but we can always choose specific link function when showing lower bounds. Thus, as my understanding, the problem will reduce to mixed linear regression, studied in those related work. 6, minor things: 23: has--> have 84: second method --> The second method 98: estimate --> estimates 103: Stein's identity comes from Stein's work, not 58, 59. 148: redefine f1
This reads a long paper that is shortened to fit the 8-page limit of NeuIPS. Such manipulation hurts the reading pace a lot since many materials are moved to supplement, and discussions are not thorough. However, I overall still find it an interesting read. My comments are minor in this regard: (1) It should be highlighted throughout that you are studying a simple single index model with a design known to the practitioner (BTW, is "design known" equivalent to saying "design known to be standard multivariate normal"?). (2) For the main results, I noticed that only lower bounds are discussed, and the matching upper bounds are moved to the supplement? I recommend the authors to at least comment in Section 3 that these lower bounds can be matched by the corresponding upper bounds, and point the readers to the exact locations where these results are put.
The introduction is very well written, with a clear presentation of the topic. However, I think that the descriptions of the statistical model and the important definitions are rather unclear (and with many typos). In general, I think that even though this paper should be very relevant to the ML community, its presentation should be revised by the authors: Some of the concepts should be described more precisely for the readers who are unfamiliar with them; The main ideas of the proofs should be explained in the main text; The reasons for the introduction of the first and second order Stein operators are not very clear in the main text; etc. Some more specific comments are given below. I have not read all the proofs, but I have not found any major issue. 1. Abstract, line 8: A star is missing twice on \beta. 2. Line 23, has -> have 3. Line 45: Some words seem to be missing... 4. Line 66: has -> have 5. Line 107: I could not understand what the authors meant when explaining that these quantities are pivotal. Actually, to me, the whole paragraph from line 107 to line 113 is really unclear. Also, the epsilon should be outside of f, in all occurrences. 6. Line 125: I do not understand the introduction of the functions \psi. The paragraph in lines 136-143 is really unclear to me. The choice of \psi does not determine f, so I do not understand the use of the words “reduces” and “characterizes” in lines 136-137. Moreover, in the definition of the minimax risk (et. 2.10), there is a supremum over \psi (on which class?), so \psi does not seem to be fixed. Hence, can the results be used, e.g., for the phase retrieval problem? 7. Line 148, just write f_1(u)=u^2+u and f_2(u)=u^2 (or else, use \beta or \beta^* consistently). 8. The notation for the minimax risk in Eq 2.10 is strange, there is a supremum over f_1, f_2, \psi, of a quantity that does not seem to depend on these functions... 9. Line 180: Remove “is” 10. Line 182: Remove the ‘s’ of “functions” 11. Definition 2.3 and the next paragraph: What exactly is r? For somebody who is not used to this terminology (such as “statistical oracle”), the definition is not very clear. It could be useful to provide some examples, maybe. Moreover, the choice of \tau_q seems arbitrary: Does it mean that one allows the algorithm to only make queries that yield sub-Gaussian errors? Why make this restriction in your model? 12. Line 249, replace T with d^\mu 13. I think that the whole paragraph 3.2 could be replaced with a table, which would be much more readable. And Figure 1 is not informative at all. 14. Could the authors discuss in more details the choice of their mixture model (eq. 2.6)? This is not what one usually thinks of a single index model, since here the function f is random (it randomly takes the values f_1 or f_2). 15. I think that the whole paragraph 3.3 could be removed. Maybe one short comment could be made about the estimation lower bound, but estimation requires many more assumptions (just regarding the identifiability issues), so I do not think that this paragraph adds anything deeply relevant to the paper. Moreover, I think that the authors should make some space in the paper in order to address the issues that I pointed out above.