Paper ID: | 841 |
---|---|

Title: | Hyperspherical Prototype Networks |

Strengths – The paper presents a novel and well-motivated approach that is crisply explained, with significant experimental results to back it up. Clean and well-documented code has been provided and the results should be easily reproducible. – The wide variety of applications covered demonstrate its versatility. In particular, the data-independent optimization, ability to bake in priors, computational savings by trimming output dimensionality, improvements over (de-facto) softmax classification in the presence of class imbalance, and suitability for multitask learning without loss weighting are strong wins. – The paper is well-written and very easy to follow. The related section does well to clearly identify and present threads in prior work. Weaknesses / Questions – What is the performance of the multitask baseline (Table 5) with appropriate loss weighting? – L95: “We first observe that the optimal set of prototypes .. one with largest cosine similarity”: How has this observation been made? Is this conjecture, or based on prior work? – What is the computational efficiency of training hyperspherical networks as compared to prior work? In particular, the estimation of prototypes (L175: gradient descent .. 1000 epochs) seems like a particularly expensive step. – Prototypical networks were initially proposed as a few-shot learning approach, with the ability to add additional classes at test time with a single forward pass. One limitation of the proposed approach is not being suitable for such a setup without retraining from scratch. A similar argument would apply for extending to open set classification, as the approach divides up the output space into as many regions as classes. While the paper does not claim this approach extends to either of these settings, have the authors thought about whether / how such an extension would be possible? – Minor: – Have the authors investigated the reason for the sharp spike in performance for both NCM and the proposed approach at 100 epochs as shown in Figure 4? – It is not clear from Figure 1b how the approach is suitable for regression, and a more detailed caption would possibly help. – Typo: L85: as many dimensions *as* classes

First and maybe most importantly, how to reasonably assign these class prototypes. The paper makes it an optimization problem which resembles Tammes problem. It can be sometimes problematic, since Tammes problem in high dimensions is highly non-convex and the optimal solution can be be obtained or even evaluated. Using gradient-based optimization is okay but far away from satisfactory, and most importantly, you can not evaluate whether you obtain a good local minima or not. Besides this, the semantic separation and maximal separation are not necessarily the same. As you mentioned in your paper, car and tighter can be similar than can and bulldozer. But how to incorporate this prior remains a huge challenge, because it involves the granularity of the classes. Naively assign class prototype on hypersphere could violate the granularity of the classes. The paper considers a very simple way to alleviate this problem by introducing a difference loss between word2vec order and the prototype order, which makes senses to me. However, it is still far from satisfactory, especially when you have thousands of classes. The class prototype assignment in high dimensions could lead to a huge problem. From my own experience, I have manually assigned CNN classifiers which have maximal inter-class distance, and then train the CNN features with these classifiers being fixed the whole time. My results show that it can be really difficult for the network to converge. Maybe adding a privileged information like the authors did could potentially help, but I am not very sure about it. Second, I like the idea of using hypersphere as the output space despite the possible technical difficulties. I have a minor concern for the classification loss. The classification loss takes the form of a least square minimization which can essentially viewed as a regression task. What if using the softmax cross-entropy loss instead? Will it be better or worse? I am quite curious about the performance