Paper ID: | 5985 |
---|---|

Title: | On Tractable Computation of Expected Predictions |

Good quality research, but I think the potential significance and impact of the results is low. The work has limited application, beyond simple toy models and does not provide new significant theoretical insights.

Originality: This work is an application of the much explored idea (partition functions, belief propagation, dynamic programing, etc.) that sums/max operations on functions whose variable dependencies decompose into a tree-like structure and can be carried efficiently via recursive procedures and caching. The paper makes a small contribution over the work of Guy Van den Broeck. The quality of the work is solid. The proofs, in particular, are very clean, and mostly amount to rearranging multiple summations of products. Definitions and statements are precise. Great job! The clarity of the writing and explanations is very good in general, but the paper is, at times, too concise in its explanation, making it easy to read only for people previously exposed to these topics. From the theoretical point of view, the significance of work is modest, since the results are not surprising, and are of similar flavor as ideas regarding how to efficiently compute sums/products of functions. The push to bring circuit theory, from the theory of computation field, to ML is laudable, and important. However, from a practical point of view, most technology seems to be driven by continuous and inexact methods (training NNs, etc.), not by discrete and exact methods. From a practical point of view, the significance is unclear. The data-sets used are very small and simple. Also, it would be good if the authors compared their technique with methods other than imputation methods, and small data-sets. Maybe compare against some sampling methods to estimate the moments.

The paper considers an important problem that has also been mentioned in the literature. The algorithms proposed are non-trivial and the paper is overall very well-written and gives a clear exposition of the concepts and the ideas used throughout with many intuitive discussions which were useful. However, some parts could be improved. Specifically, I suggest adding a related work section wherein more details about the previous work and their relation to the current work are discussed. This would also help non-experts learn more about the true contribution of this paper and the concepts introduced. Nonetheless, I believe the considered problem could of interest to many people at NeurIPS and would like to see the paper accepted if there is room. Minor comment === Line 29: an arbitrary discriminative models --> model?

Logical circuits, are DAG structures used to represent functions. Specific structural properties of these DAG structures and the corresponding logic circuit representations are used to demonstrate ways of efficient computation of higher order expectations. If the logic circuit representations of the probabilities and discriminative function satisdy specific properties, these moments are efficiently computed. The proofs of propositions provided in Section 4 (1 & 2) seem to reasonably check out. However that of Theorem 2 & 3 need more elaborate description that seems missing. Overall the paper seems to be extending results of [10]. The results are an interesting application of very old methods (logical circuit representations) to the problem of calculating expectations. The paper is clearly written and organized well barring minor typos. I am not really convinced of the significance of this work given the negative results provided in Theorems 2&3, which suggest that seemingly for simple classes of discriminative functions calculating moments using this framework is #P-hard.

The manuscript considers basic statistical questions regarding reasoning about the expected outcome of a predictive model. Efficiently computing even the expectation (first moment) is a known challenge even for simple predictive models and simple generative models (e.g. logistic regression and naive Bayes distribution). The authors give a pair of generative and discriminative models (family of structured probabilistic circuits) that enables tractable computation of expectations (and higher order moments as well), in some cases approximately, b) provide algorithms for computing moments of predictions wrt generative models and c) show that the utility of the algorithms in handling missing data during prediction time compared to standard imputation techniques on some datasets. The paper is organized and written well, there are some good technical contributions. But I'm unable to get a good grasp on the overall significance and merit of this work - partly because the authors aren't convincing enough throughout the paper. I'm also not entirely sure if NeurIPS readers are the right audience for this work - not just in terms of applying these results in practice, but primarily in terms of taking the scope of this work forward. In the problem set up of Section 2, I wasn't entirely sure where the paper was heading towards - for example, why should one be interested in *exact* computation of mean/moments at all, given that the predictive models are already constrained by the availability of training data? Furtheremore, why is this a new problem in ML/computer science? I'd think these are fundamental questions in statistics and inter-disciplinary computational science communities (I don't see references to these in Sec 2). I also didn't see an exposition of jump from logical circuits to real-valued data/inferences anywhere in the paper (nor at least a hint in the main Section). The paper does cite references to existing work that deal with these issues, but the gist of idea needs to be conveyed in Section 3 to make the ideas grounded and concrete. In fact, the Section is written assuming PSDD is a "given" which is already concerning (of course, the reader might guess that it will be learned from data as well). Not until later in the experiments does the paper give a concrete footing - "Our advantage clearly comes from the PSDD learning a better density estimation of the data distribution, instead of having fixed prior assumptions about the features." I liked the experimental design and results, but I don't know why the authors don't talk about nor compare to simple density estimation or other statistical tools to compute expectations of (non-linear) predictive functions from training data (even without caring about characterizing the data distribution, i.e.). --- I've read the authors' response; my stance on the paper is however unswayed. Overall, I like this work for some of the technical contributions, but what I'm still not clear about is why accurately determining the mean/moments is significant from machine learning perspective, as well as lack of comparisons/references to work from Stats community.