__ Summary and Contributions__: This paper proposes a boosting approach that
- implements a Newton step at each iteration,
- randomly chooses, at each iteration, a base learner among different classes,
- is provably convergent
- is specialized to the case when the base classifiers are decision trees of given depths and linear regressors, with a provided C++ implementation, called MixBoost.
== Post-feedback
Feedback taken into account, I stick to my score. Please be sure to take into account the remark on the 256 bins (i.e. state beforehand, at some point, that you work with a 8-bit representation).

__ Strengths__: Strengths
- an algorithmic strategy that allows the selection of base classifiers among different (structured) classes
- a convergence proof of the algorithm, which subsumes existing results (those given in [31])
- numerical simulations

__ Weaknesses__: Weaknesses:
- the paper could gain of having a thorough discussion on the generalization properties of the proposed learning procedure (and if no gain, that should be stated as well);
- is is not clear how (not) stringent is Assumption 3 regarding the hypothesis classes, whereas it is key for the results provided; in addition the finite class claim related to floating point arithmetic could be developed a bit more;
- it is not clear to me what the main strength of MixBoost is:
-- is it related to the computational savings?
-- is it related to the generalization performances of the method?
-- is it related to the use of Random Fourier Features?
- regarding Random Fourier Feature: wouldn't an even accelerated method such as Fastfood (Le et al. 2013) preferable to the sampling proposed here?

__ Correctness__: The claims made are correct, as far as I have checked them.
Empirical methodology: it is correct, with the authors taking time to talk about statistical significance.
There is a weird sentence: "The number of histogram bins h can be at most 256", where it is not clear why 256 should be the maximum number of bins.

__ Clarity__: It is rather clearly written.
Maybe, there is a lack of structure regarding the main ideas that have to be retained (cf. Mixboost and my remarks above).

__ Relation to Prior Work__: There is a section devoted to the related work, section 1.1. (which could be unnumbered since there is no section 1.2).

__ Reproducibility__: Yes

__ Additional Feedback__: Dealing with Assumption 3 seems to be crux: the authors should explain how from a predefined family of base learners, a new class of models can be built that fulfill the requirement of Assumption 3.
Typos:
- line 117: b(x) -> b
- line 223: samples from the Fourier transform of the Gaussian kernel -> random Gaussian vectors ?

__ Summary and Contributions__: In this paper the authors propose a general boosting method which relies on a multitude of learner classes in order to build the final classifier. The main method, coined HNBM, selects a single class at each iteration step and chooses the weak learner from that class. A practical ML tool MixBoost is derived from HNBM by fixing the learner classes (decision trees + a single linear model). Empirical results are provided showing the effectiveness of both HNBM and MixBoost.

__ Strengths__: Combining several learner classes has been a common technique in practical boosting and ensemble methods in general, since it ensures a better diversity among the base classifiers, hence better performance. While the empirical results shown in this paper are not surprising to any ensemble learning practitioner, the strength of this work resides in providing a full theoretical setting for understanding and analyzing heterogeneous base learners. To the best of my knowledge, HNBM is the first framework that provides a clear theoretical insight on heterogeneous learners which englobes several learning paradigms, from heterogeneous data/attributes, to multi-view/multi-source learning. This by itself makes this contribution of significant interest for all the ML community. In particular, HNBM opens up several research questions (different probability mass functions, theoretical aspects of diversity in ensemble learning, etc.).
A more practical contribution is MixBoost, which provides a powerful tool for heterogeneous boosting comparable to state of the art toolkits.

__ Weaknesses__: I have a few comments on this paper, even though it would be unfair to call them weaknesses. They are listed below in no particular order.
- It's regrettable that the probability mass function is practically unexploited. In MixBoost it is set to a quasi-uniform distribution, which depends on only one single parameter. Intuitively, each learner class should be considered individually, even in the case of BDT of different depths. I think that considering various probability mass function would've added further depth to the experimental setting (unless I'm missing an obvious reason why the quasi-uniform distribution is well suited...).
- Continuing from the previous point, it would have been interesting to have a discussion on how the choice of the probability mass function influences the theoretical guarantees of in section 2.4.
- The main strength of HNBM resides in using arbitrary mass functions, yet MixBoost only relies in BDT and LR. I strongly think that combining other types of classifiers should provide further insight on MixBoost.

__ Correctness__: As far as I could tell, the theoretical results and analysis provided in this paper are sound. The empirical results in the main paper could've used a bit more breathing space, but the authors do a nice job at providing ample implementation details in the supplementary material.

__ Clarity__: Very well written paper, easy to follow and the main ideas are clearly outlined. Some parts, such as Sec 2.3 might not be easy to access for the non expert reader.

__ Relation to Prior Work__: The authors provide a nice review of existing works and, to the best of my knowledge, they cite most of the key papers related to the current work.

__ Reproducibility__: Yes

__ Additional Feedback__: after rebuttal: I'm satisfied by the authors answers, and as such I'm keeping my original rating for this paper.

__ Summary and Contributions__: The paper introduces a boosting method in which the hypothesis class may vary along the iterations. More precisely, at each iteration a base hypothesis class is chosen randomly from a fixed set of hypothesis classes. The learning algorithm is a Newton descent in a functional space. Some theoretical guarantees are provided. The method is applied using decision trees of variable size and linear regressors with random Fourier features as hypothesis classes. Promising experimental results are given.

__ Strengths__: The paper is well written and easy to read. The paper is a new contribution on heterogeneous boosting. The proposed method improves performance over previous works based on the idea of a random selection of the base hypothesis class. Theoretical guarantees are provided. I did not check the proofs in detail but the proofs seem sound. The method is implemented with decision trees and linear regressors and prominsing experimental results are provided.

__ Weaknesses__: In my opinion, the paper should contain a more thorough discussion on the choice of the hypothesis classes and on the probability distribution. Theoretically, the bound in Theorem 2 is an expectation over the subclass selection. The authors should discuss this issue and provide elements for choosing the subclasses and the probability distribution. Also, the role of $\Theta$ could be discussed in more details. As a consequence, the choice of the subclasses and of the probability distribution in MixBoost seems completely ad hoc and is not justified. For the experimental part, if the hyperparameter ranges can be found in the Appendix, the optimal values are not given. The paper should provide the optimal values (mainly D_min, D_max and p_t), a study of the robustness against the choice of the hyperparameter values and, if possible, should provide hints to choose the hyperparameter values.

__ Correctness__: To the best of my knowledge, the claims are correct and the empirical methodology is correct. But, as said before, additional comments would be welcome and a robustness experimental study should be given.

__ Clarity__: The paper is well written.

__ Relation to Prior Work__: To the best of my knowledge, the related work is clearly discussed and compared with.

__ Reproducibility__: No

__ Additional Feedback__: Thanks for the feedback. While the authors have done a good job, I remain convinced that the choice of the hypothesis classes and of the PMF could have been more thoroughly discussed both theoretically and experimentally. For instance, it is surprising that D_min=D_max for every dataset. What would be the effect to consider one hypothesis class with small trees (say D_max=5) and large trees (say D_min=6 and D_max =20) and linear regressors. My evaluation is not changed and the paper can be accepted at NeurIPS.
********
The paper could be improved with more thorough discussion on the choice of the hypothesis classes and of the probability distribution both theoretically and experimentally.
The choices done in MixBoost should be justified.
For the experimental study, the optimal hyperparameters should be given.
An experimental study should be given with varying value of $p_t$. For instance, I wonder what are the results for $p_t$=1.