Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper proposed a quite interesting idea of representing data by weighted graphs (shortest path between nodes). Reviewers have raised concerns on edge dependency and the given similarity metric. However, I’m less worried about making the independence assumption because after all, it’s a model, and it seems to work well in experiments. Likewise, it is also common in variational inference to use independent distribution to approximate a graphical model, based on which learning is carried out. What interests me more is the general methodology of optimization. Suppose we want to optimize f(x). Instead of applying standard optimization, let us endow x_i with a Bernoulli distribution p_i, and then optimize E[f(x)] over all p_i by SGD/Adam/… The idea is surely straightforward, but my question is: has this been done *effectively* in existing machine learning practice? Example obstacles include high variance, computational cost, etc, which the paper has explicitly addressed by tapping into the sparsity of the shortest path solution. So I think this paper is a good addition to the conference. Technically, I wish the final version can clarify the following question regarding the p_\theta(b_i) term in Equation 2. First of all, it should be summation over all edges, rather than all instantiations of b_i because that will give a constant 1. Now if we write out the expectation and utilize the independence, each edge will contribute a term [p_\theta(0)]^2 + [p_\theta(1)]^2. I can't see how this term will encourage p_\theta(0) and p_\theta(1) to be either close to 0 or 1, because it is minimized at uniform distribution (modulo the parameterization \sigma(\theta_b)).