{"title": "Loss Functions for Multiset Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 5783, "page_last": 5792, "abstract": "We study the problem of multiset prediction. The goal of multiset prediction is to train a predictor that maps an input to a multiset consisting of multiple items. Unlike existing problems in supervised learning, such as classification, ranking and sequence generation, there is no known order among items in a target multiset, and each item in the multiset may appear more than once, making this problem extremely challenging. In this paper, we propose a novel multiset loss function by viewing this problem from the perspective of sequential decision making. The proposed multiset loss function is empirically evaluated on two families of datasets, one synthetic and the other real, with varying levels of difficulty, against various baseline loss functions including reinforcement learning, sequence, and aggregated distribution matching loss functions. The experiments reveal the effectiveness of the proposed loss function over the others.", "full_text": "Loss Functions for Multiset Prediction\n\nSean Welleck1,2, Zixin Yao1, Yu Gai1, Jialin Mao1, Zheng Zhang1, Kyunghyun Cho2,3\n\n1New York University Shanghai\n\n{wellecks,zixin.yao,yg1246,jialin.mao,zz,kyunghyun.cho}@nyu.edu\n\n2New York University\n\n3CIFAR Azrieli Global Scholar\n\nAbstract\n\nWe study the problem of multiset prediction. The goal of multiset prediction is\nto train a predictor that maps an input to a multiset consisting of multiple items.\nUnlike existing problems in supervised learning, such as classi\ufb01cation, ranking\nand sequence generation, there is no known order among items in a target multiset,\nand each item in the multiset may appear more than once, making this problem\nextremely challenging. In this paper, we propose a novel multiset loss function\nby viewing this problem from the perspective of sequential decision making. The\nproposed multiset loss function is empirically evaluated on two families of datasets,\none synthetic and the other real, with varying levels of dif\ufb01culty, against various\nbaseline loss functions including reinforcement learning, sequence, and aggregated\ndistribution matching loss functions. The experiments reveal the effectiveness of\nthe proposed loss function over the others.\n\n1\n\nIntroduction\n\nA relatively less studied problem in machine learning, particularly supervised learning, is the problem\nof multiset prediction. The goal of this problem is to learn a mapping from an arbitrary input to a\nmultiset1 of items. This problem appears in a variety of contexts. For instance, in the context of\nhigh-energy physics, one of the important problems in a particle physics data analysis is to count how\nmany physics objects, such as electrons, muons, photons, taus, and jets, are in a collision event [5]. In\ncomputer vision, object counting and automatic alt-text can be framed as multiset prediction [25, 12].\nIn multiset prediction, a learner is presented with an arbitrary input and the associated multiset of\nitems. It is assumed that there is no prede\ufb01ned order among the items, and that there are no further\nannotations containing information about the relationship between the input and each of the items in\nthe multiset. These properties make the problem of multiset prediction unique from other well-studied\nproblems. It is different from sequence prediction, because there is no known order among the items.\nIt is not a ranking problem, since each item may appear more than once. It cannot be transformed\ninto classi\ufb01cation, because the number of possible multisets grows exponentially with respect to the\nmaximum multiset size.\nIn this paper, we view multiset prediction as a sequential decision making process. Under this\nview, the problem reduces to \ufb01nding a policy that sequentially predicts one item at a time, while the\noutcome is still evaluated based on the aggregate multiset of the predicted items. We \ufb01rst propose an\noracle policy that assigns non-zero probabilities only to prediction sequences that result exactly in the\ntarget, ground-truth multiset given an input. This oracle is optimal in the sense that its prediction\nnever decreases the precision and recall regardless of previous predictions. That is, its decision is\noptimal in any state (i.e., prediction pre\ufb01x). We then propose a novel multiset loss which minimizes\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n1A set that allows multiple instances, e.g. {x, y, x}. See Appendix A for a detailed de\ufb01nition.\n\n\fthe KL divergence between the oracle policy and a parametrized policy at every point in a decision\ntrajectory of the parametrized policy.\nWe compare the proposed multiset loss against an extensive set of baselines. They include a sequential\nloss with an arbitrary rank function, sequential loss with an input-dependent rank function, and an\naggregated distribution matching loss and its one-step variant. We also test policy gradient, as was\ndone in [25] recently for multiset prediction. Our evaluation is conducted on two sets of datasets with\nvarying dif\ufb01culties and properties. According to the experiments, we \ufb01nd that the proposed multiset\nloss outperforms all the other loss functions.\n\n2 Multiset Prediction\n\n(cid:8)y1, . . . , y|Y|(cid:9), where yk \u2208 C. Some of the core properties of multiset prediction are; (1) the input x\n\nA multiset prediction problem is a generalization of classi\ufb01cation, where a target is not a single\nclass but a multiset of classes. The goal is to \ufb01nd a mapping from an input x to a multiset Y =\nis an arbitrary vector, (2) there is no prede\ufb01ned order among the items yi in the target multiset Y, (3)\nthe size of Y may vary depending on the input x, and (4) each item in the class set C may appear\nmore than once in Y. Formally, Y is a multiset Y = (\u00b5,C), where \u00b5 : C \u2192 N gives the number of\noccurrences of each class c \u2208 C in the multiset. See Appendix A for a further review of multisets.\n(cid:80)n\nAs is typical in supervised learning, in multiset prediction a model f\u03b8(x) is trained on a dataset\n{(xi,Yi)}N\ni=1 using evaluation metrics m(\u00b7,\u00b7)\ni=1 m( \u02c6Yi,Yi), where \u02c6Yi = f\u03b8(xi) denotes\nthat compare the predicted and target multisets, i.e. 1\nn\na predicted multiset. For evaluation metrics we use exact match EM( \u02c6Y,Y) = I[ \u02c6Y = Y], and the F1\nscore. Refer to Appendix A for multiset de\ufb01nitions of exact match and F1.\n\ni=1, then evaluated on a separate test set {(xi,Yi)}n\n\n3 Related Problems in Supervised Learning\n\nVariants of multiset prediction have been studied earlier. We now discuss a taxonomy of approaches\nin order to differentiate our proposal from previous work and de\ufb01ne strong baselines.\n\n3.1 Set Prediction\n\nRanking A ranking problem can be considered as learning a mapping from a pair of input x and\none of the items c \u2208 C to its score s(x, c). All the items in the class set are then sorted according\nto the score, and this sorted order determines the rank of each item. Taking the top-K items from\nthis sorted list results in a predicted set (e.g. [6]). Similarly to multiset prediction, the input x is\narbitrary, and the target is a set without any prespeci\ufb01c order. However, ranking differs from multiset\nprediction in that it is unable to handle multiple occurrences of a single item in the target set.\nMulti-label Classi\ufb01cation via Binary Classi\ufb01cation Multi-label classi\ufb01cation consists of learning\na mapping from an input x to a subset of classes identi\ufb01ed as y \u2208 {0, 1}|C|. This problem can\nbe reduced to |C| binary classi\ufb01cation problems by learning a binary classi\ufb01er for each possible\n(cid:81)|C|\nclass. Representative approaches include binary relevance, which assumes classes are conditionally\nindependent, and probabilistic classi\ufb01er chains which decompose the joint probability as p(y|x) =\nc=1 p(yc|y H(\u03c0(t+1)\n\n\u2217\n\n),\n\n4\n\n\fProofs of the remarks above can be found in Appendix B\u2013D.\nThe oracle\u2019s optimality allows us to consider a step-wise loss between a parametrized policy \u03c0\u03b8 and\nthe oracle policy \u03c0\u2217, because the oracle policy provides us with an optimal decision regardless of\nthe quality of the pre\ufb01x generated so far. We thus propose to minimize the KL divergence from the\noracle policy to the parametrized policy at each step separately. This divergence is de\ufb01ned as\n\nKL(\u03c0t\u2217(cid:107)\u03c0t\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\u03b8) = H(\u03c0t\u2217)\n\nconst. w.r.t. \u03b8\n\n\u2212 (cid:88)\n\nyj\u2208|Yt|\n\n1\n\n|Yt| log \u03c0\u03b8(yj|\u02c6y H(\u03c0(t+1)\n). This naturally follows from the fact\nthat there is no pre-speci\ufb01ed rank function, because the oracle\npolicy cannot prefer any item from the others in a free label\nmultiset. Hence, we examine here how the policy learned\nbased on each loss function compares to the oracle policy in\nterms of per-step entropy. We consider the policies trained\non MNIST Multi (10), where the differences among them\nwere most clear. As shown in Fig. 1, the policy trained on MNIST Multi (10) using the proposed\nmultiset loss closely follows the oracle policy. The entropy decreases as the predictions are made.\nThe decreases can be interpreted as concentrating probability mass on progressively smaller free\nlabels sets. The variance is quite small, indicating that this strategy is uniformly applied for any input.\nThe policy trained with reinforcement learning retains a relatively low entropy across steps, with a\ndecreasing trend in the second half. We carefully suspect the low entropy in the earlier steps is due to\nthe greedy nature of policy gradient. The policy receives a high reward more easily by choosing one\nof many possible choices in an earlier step than in a later step. This effectively discourages the policy\nfrom exploring all possible trajectories during training.\nOn the other hand, the policy found by aggregated distribution matching (LKL\ndm) has the opposite\nbehaviour. The entropy in general grows as more predictions are made. To see why this is sub-optimal,\nconsider the \ufb01nal step. Assuming the \ufb01rst nine predictions were correct, there is only one correct\nclass left for the \ufb01nal prediction . The high entropy, however, indicates that the model is placing\na signi\ufb01cant amount of probability on incorrect sequences. Such a policy may result because LKL\ndm\ncannot properly distinguish between policies with increasing and decreasing entropies. The increasing\nentropy also indicates that the policy has learned a rank function implicitly and is fully relying on it.\nWe conjecture this reliance on an inferred rank function, which is by de\ufb01nition sub-optimal, resulted\nin lower performance of aggregate distribution matching.\n\n6 Conclusion\n\nWe have extensively investigated the problem of multiset prediction in this paper. We rigorously\nde\ufb01ned the problem, and proposed to approach it from the perspective of sequential decision making.\nIn doing so, an oracle policy was de\ufb01ned and shown to be optimal, and a new loss function, called\nmultiset loss, was introduced as a means to train a parametrized policy for multiset prediction. The\nexperiments on two families of datasets, MNIST Multi variants and MS COCO variants, have revealed\nthe effectiveness of the proposed loss function over other loss functions including reinforcement\nlearning, sequence, and aggregated distribution matching loss functions. This success brings in new\nopportunities of applying machine learning to various new domains, including high-energy physics.\n\nAcknowledgments\n\nKC thanks support by eBay, TenCent, NVIDIA and CIFAR. This work was supported by Samsung\nElectronics (Improving Deep Learning using Latent Structure) and 17JC1404101 STCSM.\n\n8\n\n\fReferences\n\n[1] Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daum\u00e9, III, and John Langford.\nLearning to search better than your teacher. In Proceedings of the 32Nd International Conference\non International Conference on Machine Learning - Volume 37, ICML\u201915, pages 2058\u20132066.\nJMLR.org, 2015.\n\n[2] Hal Daum\u00e9, John Langford, and Daniel Marcu. Search-based structured prediction. Machine\n\nLearning, 75(3):297\u2013325, Jun 2009.\n\n[3] Krzysztof Dembczy\u00b4nski, Weiwei Cheng, and Eyke H\u00fcllermeier. Bayes optimal multilabel\nIn Proceedings of the 27th International\nclassi\ufb01cation via probabilistic classi\ufb01er chains.\nConference on International Conference on Machine Learning, ICML\u201910, pages 279\u2013286, USA,\n2010. Omnipress.\n\n[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale\nhierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.\nIEEE Conference on, pages 248\u2013255. IEEE, 2009.\n\n[5] W Ehrenfeld, R Buckingham, J Cranshaw, T Cuhadar Donszelmann, T Doherty, E Gallas,\nJ Hrivnac, D Malon, M Nowak, M Slater, F Viegas, E Vinek, Q Zhang, and the ATLAS Col-\nlaboration. Using tags to speed up the atlas analysis process. Journal of Physics: Conference\nSeries, 331(3):032007, 2011.\n\n[6] Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander Toshev, and Sergey Ioffe. Deep\nconvolutional ranking for multilabel image annotation. arXiv preprint arXiv:1312.4894, 2013.\n\n[7] S. Hamid Rezato\ufb01ghi, Vijay Kumar B G, Anton Milan, Ehsan Abbasnejad, Anthony Dick, and\nIan Reid. Deepsetnet: Predicting sets with deep neural networks. In The IEEE International\nConference on Computer Vision (ICCV), Oct 2017.\n\n[8] Kaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross Girshick. Mask R-CNN. arXiv preprint\n\narXiv:1703.06870, 2017.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[10] R\u00e9mi Leblond, Jean-Baptiste Alayrac, Anton Osokin, and Simon Lacoste-Julien. Searnn:\n\nTraining rnns with global-local losses, 2017.\n\n[11] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[12] V. Lempitsky and A. Zisserman. Learning to count objects in images. In Advances in Neural\n\nInformation Processing Systems, 2010.\n\n[13] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr\nDoll\u00e1r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European\nconference on computer vision, pages 740\u2013755. Springer, 2014.\n\n[14] Jinseok Nam, Eneldo Loza Menc\u00eda, Hyunwoo J Kim, and Johannes F\u00fcrnkranz. Maximizing\nsubset accuracy with recurrent neural networks in multi-label classi\ufb01cation. In I. Guyon, U. V.\nLuxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances\nin Neural Information Processing Systems 30, pages 5419\u20135429. Curran Associates, Inc., 2017.\n\n[15] Daniel O\u00f1oro-Rubio and Roberto J. L\u00f3pez-Sastre. Towards perspective-free object counting\nwith deep learning. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer\nVision \u2013 ECCV 2016, pages 615\u2013629, Cham, 2016. Springer International Publishing.\n\n[16] J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural\n\nNetworks, 21(4):682\u2013697, May 2008.\n\n9\n\n\f[17] Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classi\ufb01er chains for multi-\n\nlabel classi\ufb01cation. Machine Learning, 85(3):333, Jun 2011.\n\n[18] Mengye Ren and Richard S. Zemel. End-to-end instance segmentation with recurrent attention.\n\nIn The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.\n\n[19] Bernardino Romera-Paredes and Philip H. S. Torr. Recurrent instance segmentation. 2015.\n\n[20] St\u00e9phane Ross, Geoffrey J Gordon, and Drew Bagnell. A reduction of imitation learning and\nstructured prediction to no-regret online learning. In International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 627\u2013635, 2011.\n\n[21] Russell Stewart, Mykhaylo Andriluka, and Andrew Y. Ng. End-to-end people detection in\ncrowded scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\nJune 2016.\n\n[22] Grigorios Tsoumakas and Ioannis Katakis. Multi-label classi\ufb01cation: An overview. Int J Data\n\nWarehousing and Mining, 2007:1\u201313, 2007.\n\n[23] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for\n\nsets, 2015.\n\n[24] Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. Cnn-rnn: A\nuni\ufb01ed framework for multi-label image classi\ufb01cation. In The IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), June 2016.\n\n[25] Sean Welleck, Kyunghyun Cho, and Zheng Zhang. Saliency-based sequential image attention\n\nwith multiset prediction. In Advances in neural information processing systems, 2017.\n\n[26] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n[27] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun\nWoo. Convolutional lstm network: A machine learning approach for precipitation nowcasting.\nIn Advances in neural information processing systems, pages 802\u2013810, 2015.\n\n[28] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma. Single-image crowd counting via multi-column\nconvolutional neural network. In 2016 IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 589\u2013597, June 2016.\n\n10\n\n\f", "award": [], "sourceid": 2786, "authors": [{"given_name": "Sean", "family_name": "Welleck", "institution": "NYU"}, {"given_name": "Zixin", "family_name": "Yao", "institution": "New York University"}, {"given_name": "Yu", "family_name": "Gai", "institution": "New York University"}, {"given_name": "Jialin", "family_name": "Mao", "institution": "New York University"}, {"given_name": "Zheng", "family_name": "Zhang", "institution": "Shanghai New York Univeristy"}, {"given_name": "Kyunghyun", "family_name": "Cho", "institution": "NYU"}]}