{"title": "Hyperspherical Prototype Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1487, "page_last": 1497, "abstract": "This paper introduces hyperspherical prototype networks, which unify classification and regression with prototypes on hyperspherical output spaces. For classification, a common approach is to define prototypes as the mean output vector over training examples per class. Here, we propose to use hyperspheres as output spaces, with class prototypes defined a priori with large margin separation. We position prototypes through data-independent optimization, with an extension to incorporate priors from class semantics. By doing so, we do not require any prototype updating, we can handle any training size, and the output dimensionality is no longer constrained to the number of classes. Furthermore, we generalize to regression, by optimizing outputs as an interpolation between two prototypes on the hypersphere. Since both tasks are now defined by the same loss function, they can be jointly trained for multi-task problems. Experimentally, we show the benefit of hyperspherical prototype networks for classification, regression, and their combination over other prototype methods, softmax cross-entropy, and mean squared error approaches.", "full_text": "Hyperspherical Prototype Networks\n\nPascal Mettes\n\nISIS Lab\n\nUniversity of Amsterdam\n\nElise van der Pol\n\nUvA-Bosch Delta Lab\nUniversity of Amsterdam\n\nAbstract\n\nCees G. M. Snoek\n\nISIS Lab\n\nUniversity of Amsterdam\n\nThis paper introduces hyperspherical prototype networks, which unify classi\ufb01cation\nand regression with prototypes on hyperspherical output spaces. For classi\ufb01ca-\ntion, a common approach is to de\ufb01ne prototypes as the mean output vector over\ntraining examples per class. Here, we propose to use hyperspheres as output\nspaces, with class prototypes de\ufb01ned a priori with large margin separation. We\nposition prototypes through data-independent optimization, with an extension to\nincorporate priors from class semantics. By doing so, we do not require any pro-\ntotype updating, we can handle any training size, and the output dimensionality\nis no longer constrained to the number of classes. Furthermore, we generalize\nto regression, by optimizing outputs as an interpolation between two prototypes\non the hypersphere. Since both tasks are now de\ufb01ned by the same loss function,\nthey can be jointly trained for multi-task problems. Experimentally, we show the\nbene\ufb01t of hyperspherical prototype networks for classi\ufb01cation, regression, and\ntheir combination over other prototype methods, softmax cross-entropy, and mean\nsquared error approaches.\n\n1\n\nIntroduction\n\nThis paper introduces a class of deep networks that employ hyperspheres as output spaces with an a\npriori de\ufb01ned organization. Standard classi\ufb01cation (with softmax cross-entropy) and regression (with\nsquared loss) are effective, but are trained in a fully parametric manner, ignoring known inductive\nbiases, such as large margin separation, simplicity, and knowledge about source data [28]. Moreover,\nthey require output spaces with a \ufb01xed output size, either equal to the number of classes (classi\ufb01cation)\nor a single dimension (regression). We propose networks with output spaces that incorporate inductive\nbiases prior to learning and have the \ufb02exibility to handle any output dimensionality, using a loss\nfunction that is identical for classi\ufb01cation and regression.\nOur approach is similar in spirit to recent prototype-based networks for classi\ufb01cation, which employ\na metric output space and divide the space into Voronoi cells around a prototype per class, de\ufb01ned as\nthe mean location of the training examples [11, 12, 17, 37, 45]. While intuitive, this de\ufb01nition alters\nthe true prototype location with each mini-batch update, meaning it requires constant re-estimation.\nAs such, current solutions either employ coarse prototype approximations [11, 12] or are limited to\nfew-shot settings [4, 37]. In this paper, we propose an alternative prototype de\ufb01nition.\nFor classi\ufb01cation, our notion is simple: when relying on hyperspherical output spaces, prototypes\ndo not need to be inferred from data. We incorporate large margin separation and simplicity from\nthe start by placing prototypes as uniformly as possible on the hypershere, see Fig. 1a. However,\nobtaining a uniform distribution for an arbitrary number of prototypes and output dimensions is\nan open mathematical problem [31, 39]. As an approximation, we rely on a differentiable loss\nfunction and optimization to distribute prototypes as uniformly as possible. We furthermore extend\nthe optimization to incorporate privileged information about classes to obtain output spaces with\nsemantic class structures. Training and inference is achieved through cosine similarities between\nexamples and their \ufb01xed class prototypes.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Classi\ufb01cation.\n\n(b) Regression.\n\nFigure 1: This paper demonstrates that for (a) classi\ufb01cation and (b) regression, output spaces do\nnot need to be learned. We de\ufb01ne them a priori by employing hyperspherical output spaces. For\nclassi\ufb01cation the prototypes discretize the output space uniformly and for regression the prototypes\nenable a smooth transition between regression bounds. This results in effective deep networks with\n\ufb02exible output spaces, integrated inductive biases, and the ability to optimize both tasks in the same\noutput space without the need for further tuning.\n\nPrototypes that are a priori positioned on hyperspherical outputs also allow for regression by main-\ntaining two prototypes as the regression bounds. The idea is to perform optimization through a\nrelative cosine similarity between the output predictions and the bounds, see Fig. 1b. This extends\nstandard regression to higher-dimensional outputs, which provides additional degrees of freedom not\npossible with standard regression, while obtaining better results. Furthermore, since we optimize\nboth tasks with a squared cosine similarity loss, classi\ufb01cation and regression can be performed jointly\nin the same output space, without the need to weight the different tasks through hyperparameters. Ex-\nperimentally, we show the bene\ufb01t of hyperspherical prototype networks for classi\ufb01cation, regression,\nand multi-task problems.\n\n2 Hyperspherical prototypes\n\n2.1 Classi\ufb01cation\nFor classi\ufb01cation, we are given N training examples {(xi, yi)}N\ni=1, where xi \u2208 RL and yi \u2208 C\ndenote the inputs and class labels of the ith training example, C = {1, .., K} denotes the set\nof K class labels, and L denotes the input dimensionality. Prior to learning, the D-dimensional\noutput space is subdivided approximately uniformly by prototypes P = {p1, ..., pK}, where each\nprototype pk \u2208 SD\u22121 denotes a point on the hypersphere. We \ufb01rst provide the optimization for\nhyperspherical prototype networks given a priori provided prototypes. Then we outline how to \ufb01nd\nthe hyperspherical prototypes in a data-independent manner.\nLoss function and optimization. For a single training example\n(xi, yi), let zi = f\u03c6(xi) denote the D-dimensional output vector\ngiven a network f\u03c6(\u00b7). Because we \ufb01x the organization of the output\nspace, as opposed to learning it, we propose to train a classi\ufb01cation\nnetwork by minimizing the angle between the output vector and the\nprototype pyi for ground truth label yi, so that the classi\ufb01cation loss\nLc to minimize is given as:\n(1 \u2212 cos \u03b8zi,pyi\n\n(1 \u2212 |zi \u00b7 pyi|\n\nN(cid:88)\n\nLc =\n\nN(cid:88)\n\n)2 =\n\ni=1\n\ni=1\n\n||zi|| ||pyi|| )2.\n\n(1)\n\nThe loss function maximizes the cosine similarity between the out-\nput vectors of the training examples and their corresponding class\nprototypes. Figure 2 provides an illustration in output space S2 for a\ntraining example (orange), which moves towards the hyperspherical\nprototype of its respective class (blue) given the cosine similarity.\nThe higher the cosine similarity, the smaller the squared loss in the\nabove formulation. We note that unlike common classi\ufb01cation losses\nin deep networks, our loss function is only concerned with the map-\n\nFigure 2: Visualization of\nhyperspherical prototype net-\nwork training for classi\ufb01cation.\nOutput predictions move to-\nwards their prototypes based\non angular similarity.\n\n2\n\nGradientGroundtruthPrediction\fping from training examples to a pre-de\ufb01ned layout of the output space; neither the space itself nor\nthe prototypes within the output space need to be learned or updated.\nSince the class prototypes do not require updating, our network optimization only requires a back-\npropagation step with respect to the training examples. Let \u03b8i be shorthand for \u03b8zi,pyi\n. Then the\npartial derivative of the loss function of Eq. 1 with respect to zi is given as:\n\n(1 \u2212 cos \u03b8i)2 = 2(1 \u2212 cos \u03b8i)\n\nd\nzi\n\npyi\n\n||zi||||pyi||\n\n.\n\n(2)\n\n(cid:18) cos \u03b8i \u00b7 zi\n||zi||2 \u2212\n\n(cid:19)\n\nThe remaining layers in the network are backpropagated in the conventional manner given the error\nbackpropagation of the training examples of Eq. 2. Our optimization aims for angular similarity\nbetween outputs and class prototypes. Hence, for a new data point \u02dcx, prediction is performed by\ncomputing the cosine similarity to all class prototypes and we select the class with the highest\nsimilarity:\n\n(cid:0)cos \u03b8f\u03c6(\u02dcx),pc\n\n(cid:1) .\n\nc\u2217 = arg max\n\nc\u2208C\n\n(3)\n\nPositioning hyperspherical prototypes. The optimization hinges on the presence of class prototypes\nthat divide the output space prior to learning. Rather than relying on one-hot vectors, which only\nuse the positive portion of the output space and require at least as many dimensions as classes, we\nincorporate the inductive biases of large margin separation and simplicity. We do so by assigning each\nclass to a single hyperspherical prototype and we distribute the prototypes as uniformly as possible.\nFor D output dimensions and K classes, this amounts to a spherical code problem of optimally\nseparating K classes on the D-dimensional unit-hypersphere SD\u22121 [35]. For D = 2, this can be\neasily solved by splitting the unit-circle S1 into equal slices, separated by an angle of 2\u03c0\nK . Then, for\neach angle \u03c8, the 2D coordinates are obtained as (cos \u03c8, sin \u03c8).\nFor D \u2265 3, no such optimal separation algorithm exists. This is known as the Tammes problem [39],\nfor which exact solutions only exist for optimally distributing a handful of points on S2 and none for\nS3 and up [31]. To obtain hyperspherical prototypes for any output dimension and number of classes,\nwe \ufb01rst observe that the optimal set of prototypes, P\u2217, is the one where the largest cosine similarity\nbetween two class prototypes pi, pj from the set is minimized:\n\n(cid:18)\n\n(cid:19)\n\nP\u2217 = arg min\nP(cid:48)\u2208P\n\nmax\n\n(k,l,k(cid:54)=l)\u2208C\n\ncos \u03b8(p(cid:48)\n\nk,p(cid:48)\nl)\n\n.\n\n(4)\n\nK(cid:88)\n\ni=1\n\nTo position hyperspherical prototypes prior to network training, we rely on a gradient descent\noptimization for the loss function of Eq. 4. In practice, we \ufb01nd that computing all pair-wise cosine\nsimilarities and only updating the most similar pair is inef\ufb01cient. Hence we propose the following\noptimization:\n\nLHP =\n\n1\nK\n\nMij, M = \u02c6P \u02c6PT \u2212 2I,\n\ns.t. \u2200i || \u02c6Pi|| = 1,\n\nmax\nj\u2208C\n\n(5)\n\nwhere \u02c6P \u2208 RK\u00d7D denotes the current set of hyperspherical prototypes, I denotes the identity matrix,\nand M denotes the pairwise prototype similarities. The subtraction of twice the identity matrix\navoids self selection. The loss function minimizes the nearest cosine similarity for each prototype and\ncan be optimized quickly since it is in matrix form. The subtraction of the identity matrix prevents\nself-selection in the max-pooling. We optimize the loss function by iteratively computing the loss,\nupdating the prototypes, and re-projecting them onto the hypersphere through (cid:96)2 normalization.\nCompared to uniform sampling methods [14, 30], we explicitly enforce separation. This is because\nuniform sampling might randomly place prototypes near each other \u2013 even though each position on\nthe hypersphere has an equal chance of being selected \u2013 which negatively affects the classi\ufb01cation.\nPrototypes with privileged information. So far, no prior knowledge of classes is assumed. Hence\nall prototypes need to be separated from each other. While separation is vital, semantically unrelated\nclasses should be pushed further away than semantically related classes. To incorporate such\nprivileged information [41] from prior knowledge in the prototype construction, we start from word\nembedding representations of the class names. We note that the names of the classes generally come\nfor free. To encourage \ufb01nding hyperspherical prototypes that incorporate semantic information,\n\n3\n\n\fwe introduce a loss function that measures the alignment between prototype relations and word\nembedding relations of the classes. We found that a direct alignment impedes separation, since\nword embedding representations do not fully incorporate a notion of separation. Therefore, we use a\nranking-based loss function, which incorporates similarity order instead of direct similarities.\nMore formally, let W = {w1, ..., wK} denote the word embeddings of the K classes. Using these\nembeddings, we de\ufb01ne a loss function over all class triplets inspired by RankNet [5]:\n\n\u2212 \u00afSijk log Sijk \u2212 (1 \u2212 \u00afSijk) log(1 \u2212 Sijk),\n\n(6)\n\n(cid:88)\n\n(i,j,k)\u2208T\n\nLPI =\n\n1\n|T|\n\nwhere T denotes the set of all class triplets. The ground truth \u00afSijk = [[cos \u03b8wi,wj \u2265 cos \u03b8wi,wk ]]\nstates the ranking order of a triplet, with [[\u00b7]] the indicator function. The output Sijk \u2261 eoijk\ndenotes the ranking order likelihood, with oijk = cos \u03b8pi,pj \u2212 cos \u03b8pi,pk. Intuitively, this loss\nfunction optimizes for hyperspherical prototypes to have the same ranking order as the semantic\npriors. We combine the ranking loss function with the loss function of Eq. 5 by summing the\nrespective losses.\n\n1+eoijk\n\n2.2 Regression\n\nWhile existing prototype-based works, e.g. [11, 17, 37], focus exclusively on classi\ufb01cation, hy-\nperspherical prototype networks handle regression as well. In a regression setup, we are given\nN training examples {(xi, yi)}N\ni=1, where yi \u2208 R now denotes a real-valued regression value.\nThe upper and lower bounds on the regression task are denoted as vu and vl respectively and are\ntypically the maximum and minimum regression values of the training examples. To perform\nregression with hyperspherical prototypes, training examples should no longer point towards a\nspeci\ufb01c prototype as done in classi\ufb01cation. Rather, we maintain two prototypes: pu \u2208 SD\u22121\nwhich denotes the regression upper bound and pl \u2208 SD\u22121 which denotes the lower bound.\nTheir speci\ufb01c direction is irrelevant, as long as the two prototypes\nare diametrically opposed, i.e. cos \u03b8pl,pu = \u22121. The idea behind\nhyperspherical prototype regression is to perform an interpolation\nbetween the lower and upper prototypes. We propose the following\nhyperspherical regression loss function:\n\nN(cid:88)\n\ni=1\n\nLr =\n\n(ri \u2212 cos \u03b8zi,pu)2,\n\nri = 2 \u00b7 yi \u2212 vl\nvu \u2212 vl\n\n\u2212 1.\n\n(7)\n\nEq. 7 uses a squared loss function between two values. The \ufb01rst\nvalue denotes the ground truth regression value, normalized based\non the upper and lower bounds. The second value denotes the cosine\nsimilarity between the output vector of the training example and the\nupper bound prototype. To illustrate the intuition behind the loss\nfunction, consider Fig. 3 which shows an arti\ufb01cial training example\nin output space S2, with a ground truth regression value ri of zero.\nDue to the symmetric nature of the cosine similarity with respect\nto the upper bound prototype, any output of the training example\non the turquoise circle is equally correct. As such, the loss function\nof Eq. 7 adjusts the angle of the output prediction either away or\ntowards the upper bound prototype, based on the difference between\nthe expected and measured cosine similarity to the upper bound.\nOur approach to regression differs from standard regression, which backpropagate losses on one-\ndimensional outputs. In the context of our work, this corresponds to an optimization on the line from\npl to pu. Our approach generalizes regression to higher dimensional output spaces. While we still\ninterpolate between two points, the ability to project to higher dimensional outputs provides additional\ndegrees of freedom to help the regression optimization. As we will show in the experiments, this\ngeneralization results in a better and more robust performance than mean squared error.\n\nFigure 3: Visualization of\nhyperspherical\nprototype\nnetwork training for regres-\nsion. The output prediction\nmoves\ntowards\nthe turquoise circle, which\ncorresponds to the example\u2019s\nground truth regression value.\n\nangularly\n\n4\n\nPredictionGradientGround truth\fTable 1: Accuracy (%) of hyperspherical prototypes compared to baseline prototypes on CIFAR-100\nand ImageNet-200 using ResNet-32. Hyperspherical prototypes handle any output dimensionality,\nunlike one-hot encodings, and obtain the best scores across dimensions and datasets.\nImageNet-200\n50\n100\n-\n-\n\nCIFAR-100\n25\n50\n-\n-\n\n10\n-\n\n25\n-\n\n100\n\n200\n\n29.0 \u00b1 0.0\n51.1 \u00b1 0.7\n\n44.5 \u00b1 0.5\n63.0 \u00b1 0.1\n\n54.3 \u00b1 0.1\n64.7 \u00b1 0.2\n\n20.7 \u00b1 0.4\n38.6 \u00b1 0.2\n\n27.6 \u00b1 0.3\n44.7 \u00b1 0.2\n\n29.8 \u00b1 0.3\n44.6 \u00b1 0.0\n\nDimensions\nOne-hot\nWord2vec\nThis paper\n\n62.1 \u00b1 0.1\n57.6 \u00b1 0.6\n65.0 \u00b1 0.3\n\n33.1 \u00b1 0.6\n30.0 \u00b1 0.4\n44.7 \u00b1 0.3\n\n2.3\n\nJoint regression and classi\ufb01cation\n\nIn hyperspherical prototype networks, classi\ufb01cation and regression are optimized in the same manner\nbased on a cosine similarity loss. Thus, both tasks can be optimized not only with the same base\nnetwork, as is common in multi-task learning [6], but even in the same output space. All that is\nrequired is to place the upper and lower polar bounds for regression in opposite direction along one\naxis. The other axes can then be used to maximally separate the class prototypes for classi\ufb01cation.\nOptimization is as simple as summing the losses of Eq. 1 and 7. Unlike multi-task learning on\nstandard losses for classi\ufb01cation and regression, our approach requires no hyperparameters to balance\nthe tasks, as the proposed losses are inherently in the same range and have identical behaviour. This\nallows us to solve multiple tasks at the same time in the same space without any task-speci\ufb01c tuning.\n\n3 Experiments\n\nImplementation details. For all our experiments, we use SGD, with a learning rate of 0.01, momen-\ntum of 0.9, weight decay of 1e-4, batch size of 128, no gradient clipping, and no pre-training. All\nnetworks are trained for 250 epochs, where after 100 and 200 epochs, the learning rate is reduced by\none order of magnitude. For data augmentation, we perform random cropping and random horizontal\n\ufb02ips. Everything is run with three random seeds and we report the average results with standard\ndeviations. For the hyperspherical prototypes, we run gradient descent with the same settings for\n1,000 epochs. The code and prototypes are available at: https://github.com/psmmettes/hpn.\n\n3.1 Classi\ufb01cation\n\nEvaluating hyperspherical prototypes. We \ufb01rst evaluate the effect of hyperspherical prototypes\nwith large margin separation on CIFAR-100 and ImageNet-200. CIFAR-100 consists of 60,000\nimages of size 32x32 from 100 classes. ImageNet-200 is a subset of ImageNet, consisting of 110,000\nimages of size 64x64 from 200 classes [18]. For both datasets, 10,000 examples are used for testing.\nImageNet-200 provides a challenging and diverse classi\ufb01cation task, while still being compact\nenough to enable broad experimentation across multiple network architectures, output dimensions,\nand hyperspherical prototypes. We compare to two baselines. The \ufb01rst consists of one-hot vectors\non the C-dimensional simplex for C classes, as proposed in [2, 7]. The second baseline consists\nof word2vec vectors [27] for each class based on their name, which also use the cosine similarity\nto compare outputs, akin to our setting. For both baselines, we only alter the prototypes compared\nto our approach. The network architecture, loss function, and hyperparameters are identical. This\nexperiment is performed for four different numbers of output dimensions.\nThe results with a ResNet-32 network [13] are shown in Table 1. For both CIFAR-100 and ImageNet-\n200, the hyperspherical prototypes obtain the highest scores when the output size is equal to the\nnumber of classes. The baseline with one-hot vectors can not handle fewer output dimensions. Our\napproach can, and maintains accuracy when removing three quarters of the output space. For CIFAR-\n100, the hyperspherical prototypes perform 7.4 percent points better than the baseline with word2vec\nprototypes. On ImageNet-200, the behavior is similar. When using even fewer output dimensions, the\nrelative accuracy of our approach increases further. These results show that hyperspherical prototype\nnetworks can handle any output dimensionality and outperform prototype alternatives. We have\nperformed the same experiment with DenseNet-121 [16] in the supplementary materials, where we\nobserve the same trends; we can trim up to 75 percent of the output space while maintaining accuracy,\noutperforming baseline prototypes.\n\n5\n\n\fSeparation \u2191\n\n25\n\n100\n\n200\n\nOne-hot\nWord2vec\nThis paper\n\n5.5 \u00b1 0.3\n11.5 \u00b1 0.4\n\n28.7 \u00b1 0.4\n37.0 \u00b1 0.8\n\n51.1 \u00b1 0.7\n57.0 \u00b1 0.6\n\n63.0 \u00b1 0.1\n64.0 \u00b1 0.2\n\nTable 2: Separation stats.\n\nDimensions\nHyperspherical prototypes\nw/ privileged info\n\nmin mean max\n1.00\n1.00\n1.32\n0.26\n1.39\n0.95\n\n1.00\n1.01\n1.01\n\nTable 3: Classi\ufb01cation accuracy (%) on CIFAR-100 using ResNet-32\nwith and without privileged information in the prototype construction.\nPrivileged information aids classi\ufb01cation, especially for small outputs.\n\nIn Table 2, we have quanti\ufb01ed prototype separation of the three ap-\nproaches with 100 output dimensions on CIFAR-100. We calculate the\nmin (cosine distance of closest pair), mean (average pair-wise cosine\ndistance), and max (cosine distance of furthest pair) separation. Our\napproach obtains the highest mean and maximum separation, indi-\ncating the importance of pushing many classes beyond orthogonality.\nOne-hot prototypes do not push beyond orthogonality, while word2vec\nprototypes have a low minimum separation, which induces confusion for semantically related classes.\nThese limitations of the baselines are re\ufb02ected in the classi\ufb01cation results.\nPrototypes with privileged information. Next, we investigate the effect of incorporating privi-\nleged information when obtaining hyperspherical prototypes for classi\ufb01cation. We perform this\nexperiment on CIFAR-100 with ResNet-32 using 3, 5, 10, and 25 output dimensions. The results\nin Table 3 show that incorporating privileged information in the prototype construction is bene-\n\ufb01cial for classi\ufb01cation. This holds especially when output spaces are small. When using only 5\noutput dimensions, we obtain an accuracy of 37.0 \u00b1 0.8, compared to 28.7 \u00b1 0.4, a considerable\nimprovement. The same holds when the number of dimensions is larger than the number of classes.\nSeparation optimization becomes more dif\ufb01cult, but privileged information alleviates this problem.\nWe also \ufb01nd that the test\nconvergence rate over\ntraining epochs is higher\nwith privileged informa-\ntion. We highlight this\nin the supplementary ma-\nterials. Privileged infor-\nmation results in a faster\nconvergence, which we\nattribute to the semantic structure in the output space. We conclude that privileged information\nimproves classi\ufb01cation, especially when the number of output dimensions does not match with the\nnumber of classes.\nComparison to other prototype networks. Third, we consider networks where prototypes are\nde\ufb01ned as the class means using the Euclidean distance, e.g. [11, 17, 37, 45]. We compare to Deep\nNCM of Guerriero et al. [11], since it can handle any number of training examples and any output\ndimensionality, akin to our approach. We follow [11] and report on CIFAR-100. We run the baseline\nprovided by the authors with the same hyperparameter settings and network architecture as used in\nour approach. We report all their approaches for computing prototype: mean condensation, mean\ndecay, and online mean updates.\nIn Fig. 4, we provide the test accuracy as a function of the training epochs on CIFAR-100. Overall, our\napproach provides multiple bene\ufb01ts over Deep NCM [11]. First, the convergence of hyperspherical\nprototype networks is faster and reaches better results\nthan the baselines. Second, the test accuracy of hyper-\nspherical prototype networks changes smoother over it-\nerations. The test accuracy gradually improves over the\ntraining epochs and quickly converges, while the test\naccuracy of the baseline behaves more erratic between\ntraining epochs. Third, the optimization of hyperspher-\nical prototype networks is computationally easier and\nmore ef\ufb01cient. After a feed forward step through the\nnetwork, each training example only needs to compute\nthe cosine similarity with respect to their class proto-\ntypes. The baseline needs to compute a distance to\nall classes, followed by a softmax. Furthermore, the\nclass prototypes require constant updating, while our\nprototypes remain \ufb01xed. Lastly, compared to other\nprototype-based networks, hyperspherical prototype\nnetworks are easier to implement and require only a\nfew lines of code given pre-computed prototypes.\n\nFigure 4: Comparison to [11]. Our approach\noutperforms the baseline across all their set-\ntings with the same network hyperparame-\nters and architecture.\n\n65.0 \u00b1 0.3\n63.8 \u00b1 0.1\n\n63.7 \u00b1 0.4\n64.7 \u00b1 0.1\n\n3\n\nCIFAR-100\n5\n10\n\n6\n\n050100150200250Number of training epochs010203040506070Test accuracy (%)Deep NCM (mean decay)Deep NCM (mean condensation)Deep NCM (mean online)This paper\fComparison to softmax cross-entropy. Fourth, we compare to standard softmax cross-entropy\nclassi\ufb01cation. For fair comparison, we use the same number of output dimensions as classes\nfor hyperspherical prototype networks, although we are not restricted to this setup, while soft-\nmax cross-entropy is. We report results in Table 4. First, when examples per class are large\nand evenly distributed, as on CIFAR-100, we obtain similar scores. In settings with few or un-\neven samples, our approach is preferred. To highlight this ability, we have altered the train\nand test class distribution on CIFAR-100, where we linearly increase the number of training\nexamples for each class, from 2 up to 200. For such a distribution, we outperform softmax\ncross-entropy.\nIn our approach, all classes have a roughly equal portion of the output space,\nwhile this is not to for softmax cross-entropy in uneven settings [20]. We have also performed\nan experiment on CUB Birds 200-2011 [42], a dataset of 200 bird species, 5,994 training,\nand 5,794 test examples, i.e. a low number of exam-\nples per class. On this dataset, we perform better than\nsoftmax cross-entropy under identical networks and\nhyperparameters (47.3 \u00b1 0.1 vs 43.0 \u00b1 0.6). Lastly,\nwe have compared our approach to a softmax cross-\nentropy baseline which learns a cosine similarity using\nall class prototypes. This baseline obtains an accuracy\nof 55.5 \u00b1 0.2 on CIFAR-100, not competitive with\nstandard softmax cross-entropy and our approach. We\nconclude that we are comparable to softmax cross-\nentropy for suf\ufb01cient examples and preferred when\nexamples per class are unevenly distributed or scarce.\n\nTable 4: Accuracy (%) for our approach com-\npared to softmax cross-entropy. When ex-\namples per class are scarce or uneven, our\napproach is preferred.\n\nex / class\nSoftmax CE 64.4 \u00b1 0.4\n65.0 \u00b1 0.3\nThis paper\n\n2 to 200\n44.2 \u00b1 0.0\n46.4 \u00b1 0.0\n\n43.1 \u00b1 0.6\n47.3 \u00b1 0.1\n\nCIFAR-100\n\nCUB-200\n\n\u223c30\n\n500\n\n3.2 Regression\n\nNext, we evaluate regression on hyperspheres. We do so on the task of predicting the creation year\nof paintings from the 20th century, as available in OmniArt [38]. This results in a dataset of 15,000\ntraining and 8,353 test examples. We use ResNet-32, trained akin to the classi\ufb01cation setup. Mean\nAbsolute Error is used for evaluation. We compare to a squared loss regression baseline, where we\nnormalize and 0-1 clamp the outputs using the upper and lower bounds for a fair comparison. We\ncreate baseline variants where the output layer has more dimensions, with an additional layer to a\nreal output to ensure at least as many parameters as our approach.\nTable 5 shows the regression results\nof our approach compared to the base-\nline. We investigate both S1 and S2 as\noutputs. When using a learning rate\nof 1e-2, akin to classi\ufb01cation, our ap-\nproach obtains an MAE of 84.4 (S1)\nand 82.9 (S2). The baseline yields an\nerror rate of respectively 210.7 and\n339.9, which we found was due to ex-\nploding gradients. Therefore, we also\nemployed a learning rate of 1e-3, re-\nsulting in an MAE of 76.3 (S1) and 73.2 (S2) for our approach, compared to 110.0 and 109.9 for the\nbaseline. While the baseline performs better for this setting, our results also improve. We conclude\nthat hyperspherical prototype networks are both robust and effective for regression.\n\nTable 5: Mean absolute error rates for creation year on artistic\nimages in Omniart. Our approach obtains the best results and\nis robust to learning rate settings.\n\nOutput space\nLearning rate\nMSE\nThis paper\n\n210.7 \u00b1 140.1\n84.4 \u00b1 10.7\n\n110.3 \u00b1 0.8\n76.3 \u00b1 5.6\n\n109.9 \u00b1 0.5\n73.2 \u00b1 0.6\n\n339.9 \u00b1 0.0\n82.9 \u00b1 1.9\n\nOmniart\n\n1e-2\n\n1e-3\n\nS2\n\nS1\n\n1e-2\n\n1e-3\n\n3.3\n\nJoint regression and classi\ufb01cation\n\nRotated MNIST. For a qualitative analysis of the joint optimization we use MNIST. We classify\nthe digits and regress on their rotation. We use the digits 2, 3, 4, 5, and 7 and apply a random\nrotation between 0 and 180 degrees to each example. The other digits are not of interest given the\nrotational range. We employ S2 as output, where the classes are separated along the (x, y)-plane and\nthe rotations are projected along the z-axis. A simple network is used with two convolutional and\ntwo fully connected layers. Fig. 5 shows how in the same space, both image rotations and classes\ncan be modeled. Along the z-axis, images are gradually rotated, while the (x, y)-plane is split into\nmaximally separated slices representing the classes. This qualitative result shows both tasks can be\nmodeled jointly in the same space.\n\n7\n\n\fTable 6: Joint creation year and art style prediction on\nOmniArt. We are preferred over the multi-task baseline\nregardless of any tuning of the task weight, highlighting\nthe effectiveness and robustness of our approach.\n\n0.01\n\n0.10\n\n0.25\n\n0.50\n\n0.90\n\n262.7\n65.2\n\n344.5\n64.6\n\n348.5\n64.1\n\n354.7\n68.3\n\n352.3\n83.6\n\nTask weight\nCreation year (MAE \u2193)\nMTL baseline\nThis paper\nArt style (acc \u2191)\nMTL baseline\nThis paper\n\n49.5\n54.5\n\n44.6\n46.6\n\n47.9\n51.2\n\n47.1\n51.4\n\n47.2\n52.6\n\nFigure 5: Joint regression and classi-\n\ufb01cation on rotated MNIST. Left: col-\nored by rotation (z-axis). Right: col-\nored by class assignment (xy-plane).\nPredicting creation year and art style. Finally, we focus on jointly regressing the creation year\nand classifying the art style on OmniArt. There are in total 46 art styles, denoting the school of the\nartworks, e.g. the Dutch and French schools. We use a ResNet-32 with the same settings as above\n(learning rate is set to 1e-3). We compare to a multi-task baseline, which uses the same network\nand settings, but with squared loss for regression and softmax cross-entropy for classi\ufb01cation. Since\nthis baseline requires task weighting, we compare both across various relative weights between\nthe regression and classi\ufb01cation branches. The results are shown in Table 6. The weights listed in\nthe table denote the weight assigned to the regression branch, with one minus the weight for the\nclassi\ufb01cation branch. We make two observations. First, we outperform the multi-task baseline across\nweight settings, highlighting our ability to learn multiple tasks simultaneously in the same shared\nspace. Second, we \ufb01nd that the creation year error is lower than reported in the regression experiment,\nindicating that additional information from art style bene\ufb01ts the creation year task. We conclude that\nhyperspherical prototype networks are effective for learning multiple tasks in the same space, with no\nneed for hyperparameters to weight the individual tasks.\n\n4 Related work\n\nOur proposal relates to prototype-based networks, which have gained traction under names as\nproxies [29], means [11], prototypical concepts [17], and prototypes [9, 19, 33, 36, 37]. In general,\nthese works adhere to the Nearest Mean Classi\ufb01er paradigm [26] by assigning training examples to\na vector in the output space of the network, which is de\ufb01ned to be the mean vector of the training\nexamples. A few works have also investigated multiple prototypes per class [1, 29, 46]. Prototype-\nbased networks result in a simple output layout [45] and generalize quickly to new classes [11, 37, 46].\nWhile promising, the training of these prototype networks faces a chicken-or-egg dilemma. Training\nexamples are mapped to class prototypes, while class prototypes are de\ufb01ned as the mean of the training\nexamples. Because the projection from input to output changes continuously during network training,\nthe true location of the prototypes changes with each mini-batch update. Obtaining the true location\nof the prototypes is expensive, as it requires a pass over the complete dataset. As such, prototype\nnetworks either focus on the few-shot regime [4, 37], or on approximating the prototypes, e.g. by\nalternating the example mapping and prototype learning [12] or by updating the prototypes online\nas a function of the mini-batches [11]. We bypass the prototype learning altogether by structuring\nthe output space prior to training. By de\ufb01ning prototypes as points on the hypersphere, we are able\nto separate them with large margins a priori. The network optimization simpli\ufb01es to minimizing a\ncosine distance between training examples and their corresponding prototype, alleviating the need\nto continuously obtain and learn prototypes. We also generalize beyond classi\ufb01cation to regression\nusing the same optimization and loss function.\nThe work of Perrot and Habard [34] relates to our approach since they also use pre-de\ufb01ned prototypes.\nThey do so in Euclidean space for metric learning only, while we employ hyperspherical prototypes for\nclassi\ufb01cation and regression in deep networks. Bojanowski and Joulin [3] showed that unsupervised\nlearning is possible through projections to random prototypes on the unit hypersphere and updating\nprototype assignments. We also investigate hyperspherical prototypes, but do so in a supervised\nsetting, without the need to perform any prototype updating. In the process, we are encouraged by\nHoffer et al. [15], who show the potential of \ufb01xed output spaces. Several works have investigated prior\nand semantic knowledge in hyperbolic spaces [8, 32, 40]. We show how to embed prior knowledge\nin hyperspheric spaces and use it for recognition tasks. Liu et al. [20] propose a large margin angular\n\n8\n\n020406080100120140160180Rotation23457Digit\"seven\"\"\ufb01ve\"\"two\"\fseparation of class vectors through a regularization in a softmax-based deep network. We \ufb01x highly\nseparated prototypes prior to learning, rather than steering them during training, while enabling the\nuse of prototypes in regression.\nSeveral works have investigated the merit of optimizing based on angles over distances in deep\nnetworks. Liu et al. [23], for example, improve the separation in softmax cross-entropy by increasing\nthe angular margin between classes. In similar fashion, several works project network outputs to\nthe hypersphere for classi\ufb01cation through (cid:96)2 normalization, which forces softmax cross-entropy to\noptimize for angular separation [12, 21, 22, 43, 44, 47]. Gidaris and Komodakis [10] show that\nusing cosine similarity in the output helps generalization to new classes. The potential of angular\nsimilarities has also been investigated in other layers of deep networks [24, 25]. We also focus on\nangular separation in deep networks, but do so from the prototype perspective.\n\n5 Conclusions\n\nThis paper proposes hyperspherical prototype networks for classi\ufb01cation and regression. The key\ninsight is that class prototypes should not be a function of the training examples, as is currently the\ndefault, because it creates a chicken-or-egg dilemma during training. Indeed, when network weights\nare altered for training examples to move towards class prototypes in the output space, the class\nprototype locations alter too. We propose to treat the output space as a hypersphere, which enables\nus to distribute prototypes with large margin separation without the need for any training data and\nspeci\ufb01cation prior to learning. Due to the general nature of hyperspherical prototype networks, we\nintroduce extensions to deal with privileged information about class semantics, continuous output\nvalues, and joint task optimization in one and the same output space. Empirically, we have learned that\nhyperspherical prototypes are effective, fast to train, and easy to implement, resulting in \ufb02exible deep\nnetworks that handle classi\ufb01cation and regression tasks in compact output spaces. Potential future\nwork for hyperspherical prototype networks includes incremental learning and open set learning,\nwhere the number of classes in the vocabulary is not \ufb01xed, requiring iterative updates of prototype\nlocations.\n\nAcknowledgements\n\nThe authors thank Gjorgji Strezoski and Thomas Mensink for help with datasets and experimentation.\n\nReferences\n[1] Kelsey R Allen, Evan Shelhamer, Hanul Shin, and Joshua B Tenenbaum. In\ufb01nite mixture\n\nprototypes for few-shot learning. arXiv, 2019.\n\n[2] Bj\u00f6rn Barz and Joachim Denzler. Deep learning on small datasets without pre-training using\n\ncosine loss. arXiv, 2019.\n\n[3] Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. In ICML,\n\n2017.\n\n[4] Rinu Boney and Alexander Ilin. Semi-supervised few-shot learning with prototypical networks.\n\narXiv, 2017.\n\n[5] Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and\n\nGregory N Hullender. Learning to rank using gradient descent. In ICML, 2005.\n\n[6] Rich Caruana. Multitask learning. Machine learning, 28(1):41\u201375, 1997.\n\n[7] Soumith Chintala, Marc\u2019Aurelio Ranzato, Arthur Szlam, Yuandong Tian, Mark Tygert, and\nWojciech Zaremba. Scale-invariant learning and convolutional networks. Applied and Compu-\ntational Harmonic Analysis, 42(1):154\u2013166, 2017.\n\n[8] Octavian Ganea, Gary B\u00e9cigneul, and Thomas Hofmann. Hyperbolic neural networks. In\n\nNeurIPS, 2018.\n\n9\n\n\f[9] Tianyu Gao, Xu Han, Zhiyuan Liu, and Maosong Sun. Hybrid attention-based prototypical\n\nnetworks for noisy few-shot relation classi\ufb01cation. In AAAI, 2019.\n\n[10] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In\n\nCVPR, 2018.\n\n[11] Samantha Guerriero, Barbara Caputo, and Thomas Mensink. Deep nearest class mean classi\ufb01ers.\n\nIn ICLR, Worskhop Track, 2018.\n\n[12] Abul Hasnat, Julien Bohn\u00e9, Jonathan Milgram, St\u00e9phane Gentric, and Liming Chen. Von\nMises-Fisher mixture model-based deep learning: Application to face veri\ufb01cation. arXiv, 2017.\n\n[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, 2016.\n\n[14] J S Hicks and R F Wheeling. An ef\ufb01cient method for generating uniformly distributed points\n\non the surface of an n-dimensional sphere. Communications of the ACM, 2(4):17\u201319, 1959.\n\n[15] Elad Hoffer, Itay Hubara, and Daniel Soudry. Fix your classi\ufb01er: the marginal value of training\n\nthe last weight layer. In ICLR, 2018.\n\n[16] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected\n\nconvolutional networks. In CVPR, 2017.\n\n[17] Saumya Jetley, Bernardino Romera-Paredes, Sadeep Jayasumana, and Philip Torr. Prototypical\n\npriors: From improving classi\ufb01cation to zero-shot learning. In BMVC, 2015.\n\n[18] Fei-Fei Li, Andrej Karpathy, and Justin Johnson. https://tiny-imagenet.herokuapp.com.\n\n[19] Xiao Li, Min Fang, Dazheng Feng, Haikun Li, and Jinqiao Wu. Prototype adjustment for zero\n\nshot classi\ufb01cation. Signal Processing: Image Communication, 74:242\u2013252, 2019.\n\n[20] Weiyang Liu, Rongmei Lin, Zhen Liu, Lixin Liu, Zhiding Yu, Bo Dai, and Le Song. Learning\n\ntowards minimum hyperspherical energy. In NeurIPS, 2018.\n\n[21] Weiyang Liu, Zhen Liu, Zhiding Yu, Bo Dai, Rongmei Lin, Yisen Wang, James M Rehg, and\n\nLe Song. Decoupled networks. In CVPR, 2018.\n\n[22] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface:\n\nDeep hypersphere embedding for face recognition. In CVPR, 2017.\n\n[23] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for\n\nconvolutional neural networks. In ICML, 2016.\n\n[24] Weiyang Liu, Yan-Ming Zhang, Xingguo Li, Zhiding Yu, Bo Dai, Tuo Zhao, and Le Song.\n\nDeep hyperspherical learning. In NeurIPS, 2017.\n\n[25] Chunjie Luo, Jianfeng Zhan, Lei Wang, and Qiang Yang. Cosine normalization: Using cosine\n\nsimilarity instead of dot product in neural networks. arXiv, 2017.\n\n[26] Thomas Mensink, Jakob Verbeek, Florent Perronnin, and Gabriela Csurka. Distance-based\nimage classi\ufb01cation: Generalizing to new classes at near-zero cost. IEEE Transactions on\nPattern Analysis and Machine Intelligence, 35(11):2624\u20132637, 2013.\n\n[27] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S arXivado, and Jeff Dean. Distributed\n\nrepresentations of words and phrases and their compositionality. In NeurIPS, 2013.\n\n[28] Tom M Mitchell. The need for biases in learning generalizations. Department of Computer\n\nScience, Laboratory for Computer Science Research, Rutgers Univ. New Jersey, 1980.\n\n[29] Yair Movshovitz-Attias, Alexander Toshev, Thomas K Leung, Sergey Ioffe, and Saurabh Singh.\n\nNo fuss distance metric learning using proxies. In ICCV, 2017.\n\n[30] Mervin E Muller. A note on a method for generating points uniformly on n-dimensional spheres.\n\nCommunications of the ACM, 2(4):19\u201320, 1959.\n\n10\n\n\f[31] Oleg R Musin and Alexey S Tarasov. The Tammes problem for n=14. Experimental Mathematics,\n\n24(4):460\u2013468, 2015.\n\n[32] Maximillian Nickel and Douwe Kiela. Poincar\u00e9 embeddings for learning hierarchical represen-\n\ntations. In NeurIPS, 2017.\n\n[33] Yingwei Pan, Ting Yao, Yehao Li, Yu Wang, Chong-Wah Ngo, and Tao Mei. Transferrable\n\nprototypical networks for unsupervised domain adaptation. In CVPR, 2019.\n\n[34] Micha\u00ebl Perrot and Amaury Habrard. Regressive virtual metric learning. In NeurIPS, 2015.\n\n[35] Edward Saff and Amo Kuijlaars. Distributing many points on a sphere. The mathematical\n\nintelligencer, 19(1), 1997.\n\n[36] Harshita Seth, Pulkit Kumar, and Muktabh Mayank Srivastava. Prototypical metric transfer\n\nlearning for continuous speech keyword spotting with limited training data. arXiv, 2019.\n\n[37] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In\n\nNeurIPS, 2017.\n\n[38] Gjorgji Strezoski and Marcel Worring. Omniart: multi-task deep learning for artistic data\n\nanalysis. arXiv, 2017.\n\n[39] Pieter Tammes. On the origin of number and arrangement of the places of exit on the surface of\n\npollen-grains. Recueil des travaux botaniques n\u00e9erlandais, 27(1):1\u201384, 1930.\n\n[40] Alexandru Tifrea, Gary B\u00e9cigneul, and Octavian-Eugen Ganea. Poincar\u00e9 glove: Hyperbolic\n\nword embeddings. In ICLR, 2019.\n\n[41] Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: similarity control\n\nand knowledge transfer. Journal of Machine Learning Research, 16(2023-2049):2, 2015.\n\n[42] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The\n\ncaltech-ucsd birds-200-2011 dataset. 2011.\n\n[43] Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. Normface: L2 hypersphere\n\nembedding for face veri\ufb01cation. In MM, 2017.\n\n[44] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Zhifeng Li, Dihong Gong, Jingchao Zhou, and\n\nWei Liu. Cosface: Large margin cosine loss for deep face recognition. In CVPR, 2018.\n\n[45] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning\n\napproach for deep face recognition. In ECCV, 2016.\n\n[46] Hong-Ming Yang, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. Robust classi\ufb01cation with\n\nconvolutional prototype learning. In CVPR, 2018.\n\n[47] Yutong Zheng, Dipan K Pal, and Marios Savvides. Ring loss: Convex feature normalization for\n\nface recognition. In CVPR, 2018.\n\n11\n\n\f", "award": [], "sourceid": 841, "authors": [{"given_name": "Pascal", "family_name": "Mettes", "institution": "University of Amsterdam"}, {"given_name": "Elise", "family_name": "van der Pol", "institution": "University of Amsterdam"}, {"given_name": "Cees", "family_name": "Snoek", "institution": "University of Amsterdam"}]}