NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5057
Title:Sobolev Independence Criterion

Reviewer 1

Significance: The proposed criterion (SIC) relies on maximizing a discrepancy between a joint distribution and the product of its marginals over a set of test functions. Unlike existing criterions, SIC incorporates a gradient constraint on such test functions which enforces sparsity in the coordinates. This provides a way of selecting the most dependent dimensions for which the gradient of the optimal test function will not vanish. Such a criterion provides with additional information compared to the mutual Information or to the HSIC. In that sense, it has potentially many applications. Originality: The idea of using a sparsity-inducing penalty on the gradient of the test functions leads to an elegant formulation that provides interpretable scores of dependence on each variable. Clarity: Very clear presentation of the method and good organization of the paper. I have one minor suggestion: I found, Figure 1 hard to parse at first because it makes it hard to compare the different methods: TPR and FDR are expected to behave differently (TPR: highest is better and opposite for FDR) but they are all in the same figures. Would it be better to have one figure for TPR and another one for FDR? Quality: The paper is technically sound and precise. The experiments seem to support the main claims of the paper which is to learn the dimensions that maximize the dependence between variables. The authors compared the proposed method with other methods: Elastic net, Random forest and simple MSE with an additional gradient penalty. It seems that overall, SIC performs as well as random forests on simple examples but benefits from additional interpretability. It also seems to lead to better False discovery proportions on more complicated datasets compared to GLMs. However, I still have a few questions about the paper: It is not totally clear to me how the gradient penalty could help more than simply using the formulation (P) which learns a sparse selector gate to detect the most relevant features. I think it would be very beneficial to the paper to include an experimental comparison with this simple formulation (for instance in figure 1). Also in the convex case, the authors show the existence of the solution but what remains open is the consistency of the estimator obtained in that case: does it recover the true independence structure? Finally, in section 5, a neural network without bias was used which leads to a class of homogeneous functions. Such class can be very restrictive in that it doesn't have universall approximation capabilities. How critical is this choice of function for the method? Overall I think that the paper proposes an interesting and novel method that has many applications. ================================ After reading the other reviews and the authors' response, I still think this is a good submission and should be accepted. It would be great to incorporate the explanations provided in the response to the final version of paper.

Reviewer 2

Updated review: I thank the authors for their effort and time in addressing the authors concerns. I am happy with the provided clarifications and will keep my score as is. --------------------------- In this paper, the authors tackle the important problem of feature selection. Moreover, they focus on deriving an interpretable feature selection method that can easily be combined to control the rate of false discovery. Both having an interpretable feature selection method as well as efficiently controlling the false discovery rate are important problems in themselves. This paper proposes the Sobolev Independence Criterion, a criterion combining aspects of integral probability measures and gradient-sparsity penalties. The authors do a very good job in motivating the gradient sparsity regularization for SIC. While the first proposal of SIC is non-smooth and biased, auxiliary variables are introduced to mitigate the problem. Furthermore, the auxiliary variables have the convenient interpretation of being normalized importance scores for the features. Under a special class of critics (as considered in section 4), the authors show that SIC is convex and they prove the existence of a unique solution and the convergence of the perturbed SIC to the unperturbed SIC in the limit. To the best of my knowledge, the proofs provided in the appendix are correct. Next, the authors combine SIC with deep relu networks and define the Boosted SIC using different random seeds. As an application of SIC false discovery rate control is proposed, and SIC is combined with holdout randomization tests and knockoffs. While the related work section is rather short, it is succinct and mentions the important previous work in the area. To improve this part of the paper a bit, I would propose to include a bit more detailed comparison of the proposed method with previous work (e.g. what shortcomings of previous methods are mitigated with SIC?). At the end, an experimental evaluation of SIC is provided on both synthetic and real-world datasets. While not outperforming the competing methods on all instances of all datasets, SIC shows competitive performance. While there is a large diversity in the datasets considered, there is only comparison to 1 competing method in the two real-world datasets. I would advise the authors to provide comparison with further competing methods to strengthen the case for their approach. Overall, this paper is very nicely and coherently written with a nice balance of motivation and details in the main text with the proofs present in the appendix. Furthermore, this paper makes as interesting and novel contribution to the field of feature selection with convincing experimental results. Thus, I recommend the acceptance of this paper.

Reviewer 3

Summary: The paper proposed the Sobolev Independence Criterion (SIC), an interpretable dependency measure between multivariate random variables X (input) and Y (response). It is an integral probability metric between the joint distribution and the product of the marginals, regularised by gradient of the witness function wrt each dimensions of X. For a general function space, the optimisation problem has l1- like penalty on the gradient terms to ensure sparsity and l2-like penalty on the square of the witness function to induce smoothness. This form of the optimisation is not easy to work with: 1) expectation appears after the square root in the gradient penalties (i.e. a non-smooth term) and 2) the expectation inside the nonlinearity introduces a gradient estimation bias (biased expectation estimation). The authors proposed to alleviated these problem by introducing an auxiliary variable \eta through the variational form of the square root. This \eta_j parameter is the normalised importance score for each feature j of X and hence can be used for feature selection through their ranking. Two classes of functions have been studied in the paper: fixed feature space and deep ReLu networks with biases removed. When the SIC witness function f is restricted to an RKHS which lead to an optimisation problem that is jointly convex in f and \eta, they provide theoretical guarantees on the existence and uniqueness of the solution [Theorem 1], convergence results of the perturbed SIC (i.e. SIC_\epsilon where \epsilon is an regularisation term added inside the square root of the nonlinear sparsity penalty to induce smoothness) to SIC [Theorem 2] and the decomposition of SIC into the sum of contributions from each coordinates [Corollary 1]. Instead of choose a feature map, they propose to use deep ReLu networks to learn them, however the optimisation problem becomes non convex. They utilises existing optimisation algorithms and FDR control methods to establish SIC based feature selection algorithms. The paper concludes with experiments on two synthetic dataset and two real life datasets. The paper is clearly written however there are a few typos. The structure is logical and explanation is adequate. It utilises the sparsity inducing gradient penalties, that were popular in other areas of machine learning (e.g. double back-propagation, WGAN_GP, optimal transport theory, etc.), as a regularisation term in the traditional integral probability metric. Consequently, one has direct access to the importance score of each feature and this naturally leads to a feature selection scheme. Such formulation is general and one is allowed to consider different function spaces. The work provides an interesting formulation of a dependency measure and of the problem of feature selection, which is worth exploring further. However, overall I find the experiments section requires much more discussion, or at times, more experiments to illustrate the performance of the proposed procedure. I have a few questions/concerns that require some clarifications from the authors: Main concerns: 1) From my understanding, the paper proposed two FDR controlled SIC: HRT-SIC and knockoff SIC. Some more experiments or explanations might be needed since it is not clear to me if the two methods perform similarly? Or are there situations where one clearly is better than the other? Or there are situations where one can not apply one of them? 2) The authors proposed the a kernel SIC and then alleviate the problem of kernel selection by introducing the neural SIC, where the feature maps are learnt. My understanding is that it is highly problem/data dependent when neural networks outperforms kernel methods. Have you tried to use a Gaussian kernel with a fix length-scale (e.g. median heuristic) in the experimental comparison? Is it always better to use the neural SIC? I feel some discussion/experiments are missing here. 3) There is no discussion of the trade-off between the computational time and performance. Can the authors provide some comments on this? I find it hard to judge the gain in the performance without knowing the computational cost of the proposed method. 4) I am a little confused by the beginning of section 4: this seems to be a slightly different formulation from the usual kernel set up. For HSIC (or MMD), the kernel is required to be bounded (i.e. E_z(\sqrt{k(z,z)})< \infty) for the existence of the kernel mean embedding and for the mean embedding to be injective, we further require the kernel to be characteristic. The only constraint on this space of function seems to be through the parameter u, does it mean we can have any feature map \Phi_\omega in the setting presented (e.g. tensor product of feature maps that correspond to non characteristic kernels?) Or does these requirements translates into condition on the parameter u? Or is it in fact not necessary to have any conditions on the feature map? Can the authors please explain these? 5) line 116 on page 4, Theorem 1, (2) why is it important to show that L_\epsilon has compact level sets on the probability simplex? It is also not clear to me how it was show in the proof? Minor concerns: 1) line 49 on page 2, the authors mentioned that it becomes a generalised definition of Mutual information, has this “generalised” concept been studied somewhere already? 2) line 55 equation, what is this zero norm of w? 3) line 90 on page 4, why can we remove the 1/N term from the last term? 4) The HIV-1Drug Resistance with Knockoffs-SIC experiment, it is not clear to me if it is a single run of the experiments or an average over multiple runs? What does GLM stands for? 5) line 386 on page 13, it is not clear to me how we can take partial derivative of L, we can obtain the matrix on the right? Should it not be g? Also the line before line 387 has some typos. ========================= Authors comments read. I am happy with the comments provided by the authors, it would be great if the relevant explanations provided could be elaborated in the revision. I have adjusted my score accordingly.