__ Summary and Contributions__: - The paper considers the problem of learning from aggregate observations, where the individual labels are not available, but an aggregate of these are available through some aggregation function (known).
- The paper proposes a general setting for dealing with different aggregation, and provide multiple applications - pairwise similarity/triplet comparison (classification) and rank observation (regression).

__ Strengths__: - While their probabilistic framework is not quite new, previously explored in a similar fashion in [1] (ignoring the GP base model), the precise setup is novel, as it covers a wider setting, and the authors propose a wide range of applications ([1] mostly do so for exponential family).
- In particular, the idea of consistency of learning from aggregate observations is interesting, and the theory provided is useful for the precise formulation of the problem (for maximum likelihood at least).
- The authors proposal of the 3 applications is mostly new (though maybe not the regression via mean, as this was explored also in [1]), and they provide their MLE analysis for these applications.
- For these applications, the authors provide a large set of experimental results to demonstrate their method is superior vs their baselines.
[1] Variational learning on aggregate outputs with Gaussian processesI t
HC Law, D Sejdinovic, E Cameron, T Lucas, S Flaxman, K Battle, K Fukumizu
Advances in Neural Information Processing Systems, 2018, 6081-6091

__ Weaknesses__: If the framework is available for classification (LLP - learning from label proportions), I would like to see some experimental results on that, so that we can see how the method compares against stronger baselines. The baselines currently are quite weak in the experiments, however I note that there might not be too much related work on these applications in this setting.

__ Correctness__: I did not check the proofs very detailedly, but however the authors proposal make sense and the empirical methodology is fine.

__ Clarity__: I think the paper is well written in general.

__ Relation to Prior Work__: I am not too familiar on the baselines on the applications that the authors proposed, however from related work it seems to be mostly clear.

__ Reproducibility__: Yes

__ Additional Feedback__: Is there a reason the authors use linear regression as one of the methodology for the experiments? I presume any differential model can be used, since we are just using MLE.
-------
I have updated my score to a 7 after rebuttal and reading other reviewer comments. The overall framework is useful, and there seems to be applications previously not thought about.

__ Summary and Contributions__: The authors formalize a consistent estimation procedure for learning from aggregate observations. This is an important and recently overlooked problem. The authors propose a classical MLE for this. Evaluations make sense.
---
My original rating still stands.

__ Strengths__: I like this paper. This problem of learning from pool/bag of observations where one cannot identify each instance separately (thereby having some information loss) is very important in weakly labeled learning; self supervised learning; personalization etc. The presentation of the model and evaluations are good.

__ Weaknesses__: Presentation can be improved a bit. And the evaluations may be expanded, but nevertheless this is already a good paper. Passed the threshold of the conference.

__ Correctness__: Yes.

__ Clarity__: Yes. Very well written.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: As stated about this is an important problem; learning from bags of data points with a cumulative descriptive statistic representing them. And this is clearly a longer journal compressed to a 8 page format. To that end, the current presentation can be improved. Firstly adding few remarks to the main technical result is useful. Second, how does this setup differ from the weakly labeled setting (or does the proposed estimation provide some guarantees for such weakly labeled aggregation based learners).

__ Summary and Contributions__: Substantially expands the previous "aggregate function learning" settings meant for binary classification to multiclass and regression.

__ Strengths__: Formulation is clean, very well presented, and covers several aggregation scenarios.

__ Weaknesses__: It would be nice to present some real life problems where the settings they model naturally arise. I can think of one situation but there Assumption 2 does not hold. Currently they reformulate som multi-class problems to fit their scenarios. So the exercise seems academic without immediate industry impact.

__ Correctness__: yes and yes

__ Clarity__: very well

__ Relation to Prior Work__: yes

__ Reproducibility__: Yes

__ Additional Feedback__: UPDATE: I have read and taken into account the rebuttal, as well as any ensuing discussions.

__ Summary and Contributions__: This paper presents a general framework for learning classification/regression models from aggregate observations, e.g., the similarity, the mean, and the rank. The model parameters can be estimated via the maximum likelihood principle. The characteristics of the solution are analyzed theoretically.

__ Strengths__: The problem that the authors are working on is important and fundamental. The formulation is general, and I think it is the first attempt to consider theoretical aspects for learning from aggregate observations.

__ Weaknesses__: Some problems discussed in this work have been addressed previously, and this work is its generalization. It is important and useful work, but the task, i.e., learning from aggregate observations, might not be novel.

__ Correctness__: I think this paper is technically sound, and the experimental section is well-organized.

__ Clarity__: The paper is well presented, easily understandable.

__ Relation to Prior Work__: Important related works are well presented. Meanwhile, there are other studies that are close to the motivation for this work. If necessary, it might be helpful to add a discussion of the relevance of these studies, called *ecological inference*.
・Flaxman, S. R.; Wang, Y. X.; and Smola, A. J. 2015. Who supported Obama in 2012?: Ecological inference through distribution regression. In KDD, 289–298.
・D. R. Sheldon and T. G. Dietterich. Collective graphical models. In NIPS, pages 1161–1169, 2011.

__ Reproducibility__: Yes

__ Additional Feedback__: Please respond to the above comments.
Additional questions.
1. The experiments showed a significant improvement in accuracy in the triplet comparison, but I did not understand why. Also, the significant improvement in the triplet comparison is a good result, but what are some specific/practical examples where the triplet comparison is given?
2. In the regression task, how did you select multiple samples that are aggregated? randomly selected?
After author feedback:
I appreciate the authors' feedback.
The task addressed in this work is not novel, but I think it has good contributions, that is, 1) provide a clear and general formulation for learning from aggregate observations, 2) discuss the theoretical aspects of the MLE in Section 4.
For publication, I would like the authors to discuss the following issues:
1) As the other reviewer mentioned, Assumption 2 seems restrictive in some situations. In the regression task, the authors said in the author response that the samples aggregated were picked randomly. However, for example, spatial data could be often aggregated over appropriate regions, so that the samples in each region are correlated. The authors mentioned its limitation in Line 86-87 in the manuscript, but it would be better to emphasize this limitation from theoretical and practical aspects.
2) I think the authors cited related works sufficiently in the field of machine learning, but related works actually exist in other fields such as statistics/geo-statistics. One of them is the concept, ecological fallacy or ecological inference (mentioned in my review), which addresses the problem of learning individual-level model from aggregate observations. In this line, ML-specific tasks (e.g., classification) have not been addressed, but it is related to this work. So, I recommend the authors will include this related concept in related works.