Paper ID: | 1349 |
---|---|

Title: | Integrating Bayesian and Discriminative Sparse Kernel Machines for Multi-class Active Learning |

The algorithm is based on an optimization problem that is well motivated and can be tackled using QP solver. This seems to clearly outperform several baselines. The main drawback of this paper is lack of theoretical guarantees of the active learning procedure. Typically, one would want to prove generalization and label complexity bounds.

The idea of combining RVM and SVM is interesting, especially in the context of active learning, since it encourages selecting points both near the decision boundary and exploring the input space well. The paper is well written. The experiments seem to demonstrate advantages over baselines. The usage of the word "generative" in this paper might be inappropriate. I think RVM is a probabilistic method, but not generative since it doesn't model P(x | y). See wikipedia for several common definitions of "generative" and "discriminative" models. In Figure 4, why in some of the plots (e.g., for Yeast and Reuters datasets), the learning curves started from different accuracy? Not all active learning methods use the KMC learning algorithm? If it's not, then this raises the question of whether the advantage of the proposed method is due to the AL policy or the learning algorithm. If it is the same learning algorithm, then all learning curves should start from the same accuracy. Are the experiments averaged over multiple repeats or just a single repeat? It's not mentioned the experiments are repeated, so I'm concerned about the statistical significance of the results. How large is S in the experiments on the real datasets (Figure 4)? Table captions should be on top of the tables. [update]: from the author's rebuttal, it seems the proposed KMC algorithm appear to outperform SVM or RVM. As different active learning policies are coupled with different learning algorithms, given the reported better performance of the proposed KMC on passive learning setting, it becomes unclear what contributed to the better performance of active learning: is it the policy or the learning algorithm? The new learning algorithm itself could still be a nice contribution. But as an active learning paper, I think the authors should more analysis to make this clear.

Originality: The combination of sampling in areas of 'greater interest' while adjusting to the underlying distribution appears in many active learning works, but the objective in (1) is novel and approaching both in a unified framework is challenging. The lower bounding of the optimization problem is also new Quality: The experimental results are very thorough and show the improvement of the proposed method over random sampling as well as several other baselines. And the exploration of effect of tuning parameters and initial sample size is excellent. However the theoretical contributions appear incomplete. The significant theoretical contribution is the (mislabelled) Theorem 2, and both the statement and proof of this is extremely informal. This should be made much more formal. And since the sparsity structure is integral to the computational efficiency of the method this is a useful and important Theorem. Clarity: The paper makes a clear line from the motivation to the setup. However the derivation of (1) is quite difficult to follow, even though the expression itself is intuitive. And the flow from expression (1) to the final expression at (8) is well structured. It is not entirely clear to the reader that the simulations in Figure 2 and Figure 3 are showing what the authors claim they show. Here the excess exploration by KMC appears unnecessary, and an example where the SVM excessively samples an area with very low density might be more useful. The real experiments do a good job of showing the efficacy of the proposed method. The paper has a few typos (see Theorem 2) but overall they did not detract from the readability. Significance: Optimizing for both informativeness and relevance in active learning is a difficult task, and unified approaches like this are a good avenue for approaching the problem. However in order to maximize impact it would be good for the approach to provide some sort of guarantees about the quality of their solution, even just in simple situations. Edit: If the full proof sketched in the response is added to the paper I am happy to increase my recommendation to an accept, especially since in practice the authors reported that the sparsity rate was around 80% (though could they please confirm this was across all the data sets, especially the real world ones).