{"title": "Deep Sets", "book": "Advances in Neural Information Processing Systems", "page_first": 3391, "page_last": 3401, "abstract": "We study the problem of designing models for machine learning tasks defined on sets. In contrast to the traditional approach of operating on fixed dimensional vectors, we consider objective functions defined on sets and are invariant to permutations. Such problems are widespread, ranging from the estimation of population statistics, to anomaly detection in piezometer data of embankment dams, to cosmology. Our main theorem characterizes the permutation invariant objective functions and provides a family of functions to which any permutation invariant objective function must belong. This family of functions has a special structure which enables us to design a deep network architecture that can operate on sets and which can be deployed on a variety of scenarios including both unsupervised and supervised learning tasks. We demonstrate the applicability of our method on population statistic estimation, point cloud classification, set expansion, and outlier detection.", "full_text": "Deep Sets\n\nManzil Zaheer1,2, Satwik Kottur1, Siamak Ravanbhakhsh1,\n\nBarnab\u00e1s P\u00f3czos1, Ruslan Salakhutdinov1, Alexander J Smola1,2\n\n{manzilz,skottur,mravanba,bapoczos,rsalakhu,smola}@cs.cmu.edu\n\n1 Carnegie Mellon University\n\n2 Amazon Web Services\n\nAbstract\n\nWe study the problem of designing models for machine learning tasks de\ufb01ned on\nsets. In contrast to traditional approach of operating on \ufb01xed dimensional vectors,\nwe consider objective functions de\ufb01ned on sets that are invariant to permutations.\nSuch problems are widespread, ranging from estimation of population statistics [1],\nto anomaly detection in piezometer data of embankment dams [2], to cosmology [3,\n4]. Our main theorem characterizes the permutation invariant functions and provides\na family of functions to which any permutation invariant objective function must\nbelong. This family of functions has a special structure which enables us to design\na deep network architecture that can operate on sets and which can be deployed on\na variety of scenarios including both unsupervised and supervised learning tasks.\nWe also derive the necessary and suf\ufb01cient conditions for permutation equivariance\nin deep models. We demonstrate the applicability of our method on population\nstatistic estimation, point cloud classi\ufb01cation, set expansion, and outlier detection.\n\nIntroduction\n\n1\nA typical machine learning algorithm, like regression or classi\ufb01cation, is designed for \ufb01xed dimen-\nsional data instances. Their extensions to handle the case when the inputs or outputs are permutation\ninvariant sets rather than \ufb01xed dimensional vectors is not trivial and researchers have only recently\nstarted to investigate them [5\u20138]. In this paper, we present a generic framework to deal with the\nsetting where input and possibly output instances in a machine learning task are sets.\nSimilar to \ufb01xed dimensional data instances, we can characterize two learning paradigms in case of\nsets. In supervised learning, we have an output label for a set that is invariant or equivariant to the\npermutation of set elements. Examples include tasks like estimation of population statistics [1], where\napplications range from giga-scale cosmology [3, 4] to nano-scale quantum chemistry [9].\nNext, there can be the unsupervised setting, where the \u201cset\u201d structure needs to be learned, e.g. by\nleveraging the homophily/heterophily tendencies within sets. An example is the task of set expansion\n(a.k.a. audience expansion), where given a set of objects that are similar to each other (e.g. set of\nwords {lion, tiger, leopard}), our goal is to \ufb01nd new objects from a large pool of candidates such\nthat the selected new objects are similar to the query set (e.g. \ufb01nd words like jaguar or cheetah\namong all English words). This is a standard problem in similarity search and metric learning, and\na typical application is to \ufb01nd new image tags given a small set of possible tags. Likewise, in the\n\ufb01eld of computational advertisement, given a set of high-value customers, the goal would be to \ufb01nd\nsimilar people. This is an important problem in many scienti\ufb01c applications, e.g. given a small set of\ninteresting celestial objects, astrophysicists might want to \ufb01nd similar ones in large sky surveys.\n\nMain contributions.\nIn this paper, (i) we propose a fundamental architecture, DeepSets, to deal\nwith sets as inputs and show that the properties of this architecture are both necessary and suf\ufb01cient\n(Sec. 2). (ii) We extend this architecture to allow for conditioning on arbitrary objects, and (iii) based\non this architecture we develop a deep network that can operate on sets with possibly different sizes\n(Sec. 3). We show that a simple parameter-sharing scheme enables a general treatment of sets within\nsupervised and semi-supervised settings. (iv) Finally, we demonstrate the wide applicability of our\nframework through experiments on diverse problems (Sec. 4).\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2 Permutation Invariance and Equivariance\n2.1 Problem De\ufb01nition\nA function f transforms its domain X into its range Y. Usually, the input domain is a vector space\nRd and the output response range is either a discrete space, e.g. {0, 1} in case of classi\ufb01cation, or\na continuous space R in case of regression. Now, if the input is a set X = {x1, . . . , xM}, xm \u2208 X,\ni.e., the input domain is the power set X = 2X, then we would like the response of the function to be\n\u201cindifferent\u201d to the ordering of the elements. In other words,\nProperty 1 A function f : 2X \u2192 Y acting on sets must be permutation invariant to the order of\nobjects in the set, i.e. for any permutation \u03c0 : f ({x1, . . . , xM}) = f ({x\u03c0(1), . . . , x\u03c0(M )}).\nIn the supervised setting, given N examples of of X (1), ..., X (N ) as well as their labels y(1), ..., y(N ),\nthe task would be to classify/regress (with variable number of predictors) while being permutation\ninvariant w.r.t. predictors. Under unsupervised setting, the task would be to assign high scores to valid\nsets and low scores to improbable sets. These scores can then be used for set expansion tasks, such as\nimage tagging or audience expansion in \ufb01eld of computational advertisement. In transductive setting,\neach instance x(n)\nm . Then, the objective would be instead to learn\na permutation equivariant function f : XM \u2192 Y M that upon permutation of the input instances\npermutes the output labels, i.e. for any permutation \u03c0:\n\nm has an associated labeled y(n)\n\nf ([x\u03c0(1), . . . , x\u03c0(M )]) = [f\u03c0(1)(x), . . . , f\u03c0(M )(x)]\n\n(1)\n\n2.2 Structure\nWe want to study the structure of functions on sets. Their study in total generality is extremely dif\ufb01cult,\nso we analyze case-by-case. We begin by analyzing the invariant case when X is a countable set and\nY = R, where the next theorem characterizes its structure.\nTheorem 2 A function f (X) operating on a set X having elements from a countable universe, is a\nvalid set function, i.e., invariant to the permutation of instances in X, iff it can be decomposed in the\n\nx\u2208X \u03c6(x)(cid:1), for suitable transformations \u03c6 and \u03c1.\nx\u2208X \u03c6(x)(cid:1) holds for sets of \ufb01xed size. The proofs and dif\ufb01culties in handling the uncountable\n\nThe extension to case when X is uncountable, like X = R, we could only prove that f (X) =\n\nform \u03c1(cid:0)(cid:80)\n\u03c1(cid:0)(cid:80)\n\ncase, are discussed in Appendix A. However, we still conjecture that exact equality holds in general.\nNext, we analyze the equivariant case when X = Y = R and f is restricted to be a neural network\nlayer. The standard neural network layer is represented as f\u0398(x) = \u03c3(\u0398x) where \u0398 \u2208 RM\u00d7M is the\nweight vector and \u03c3 : R \u2192 R is a nonlinearity such as sigmoid function. The following lemma states\nthe necessary and suf\ufb01cient conditions for permutation-equivariance in this type of function.\nLemma 3 The function f\u0398 : RM \u2192 RM de\ufb01ned above is permutation equivariant iff all the off-\ndiagonal elements of \u0398 are tied together and all the diagonal elements are equal as well. That is,\nI \u2208 RM\u00d7M is the identity matrix\n\u0398 = \u03bbI + \u03b3 (11T)\nThis result can be easily extended to higher dimensions, i.e., X = Rd when \u03bb, \u03b3 can be matrices.\n2.3 Related Results\nThe general form of Theorem 2 is closely related with important results in different domains. Here,\nwe quickly review some of these connections.\n\n\u03bb, \u03b3 \u2208 R 1 = [1, . . . , 1]T \u2208 RM\n\nde Finetti theorem. A related concept is that of an exchangeable model in Bayesian statistics, It is\nbacked by deFinetti\u2019s theorem which states that any exchangeable model can be factored as\n\n(cid:90)\n\n(cid:34) M(cid:89)\n\nm=1\n\n(cid:35)\n\np(X|\u03b1, M0) =\n\nd\u03b8\n\np(xm|\u03b8)\n\np(\u03b8|\u03b1, M0),\n\n(2)\n\nwhere \u03b8 is some latent feature and \u03b1, M0 are the hyper-parameters of the prior. To see that this \ufb01ts\ninto our result, let us consider exponential families with conjugate priors, where we can analytically\ncalculate the integral of (2). In this special case p(x|\u03b8) = exp ((cid:104)\u03c6(x), \u03b8(cid:105) \u2212 g(\u03b8)) and p(\u03b8|\u03b1, M0) =\nexp ((cid:104)\u03b8, \u03b1(cid:105) \u2212 M0g(\u03b8) \u2212 h(\u03b1, M0)). Now if we marginalize out \u03b8, we get a form which looks exactly\nlike the one in Theorem 2\n\n(cid:33)\n\n(cid:33)\n\np(X|\u03b1, M0) = exp\n\nh\n\n\u03b1 +\n\n\u03c6(xm), M0 + M\n\n\u2212 h(\u03b1, M0)\n\n.\n\n(3)\n\n(cid:32)\n\n(cid:32)\n\n(cid:88)\n\nm\n\n2\n\n\f(cid:80)\nRepresenter theorem and kernel machines. Support distribution machines use f (p) =\ni \u03b1iyiK(pi, p) + b as the prediction function [8, 10], where pi, p are distributions and \u03b1i, b \u2208 R.\nIn practice, the pi, p distributions are never given to us explicitly, usually only i.i.d. sample sets\nare available from these distributions, and therefore we need to estimate kernel K(p, q) using these\nsamples. A popular approach is to use \u02c6K(p, q) = 1\ni,j k(xi, yj), where k is another kernel\noperating on the samples {xi}M\nj=1 \u223c q. Now, these prediction functions can be seen\n\ufb01tting into the structure of our Theorem.\n\ni=1 \u223c p and {yj}M(cid:48)\n\nMM(cid:48)(cid:80)\n\nSpectral methods. A consequence of the polynomial decomposition is that spectral methods [11]\ncan be viewed as a special case of the mapping \u03c1 \u25e6 \u03c6(X): in that case one can compute polynomials,\nusually only up to a relatively low degree (such as k = 3), to perform inference about statistical\nproperties of the distribution. The statistics are exchangeable in the data, hence they could be\nrepresented by the above map.\n3 Deep Sets\n3.1 Architecture\nInvariant model. The structure of permutation invariant functions in Theorem 2 hints at a general\nstrategy for inference over sets of objects, which we call DeepSets. Replacing \u03c6 and \u03c1 by universal\napproximators leaves matters unchanged, since, in particular, \u03c6 and \u03c1 can be used to approximate\narbitrary polynomials. Then, it remains to learn these approximators, yielding in the following model:\n\u2022 Each instance xm is transformed (possibly by several layers) into some representation \u03c6(xm).\n\u2022 The representations \u03c6(xm) are added up and the output is processed using the \u03c1 network in the\nsame manner as in any deep network (e.g. fully connected layers, nonlinearities, etc.).\n\u2022 Optionally: If we have additional meta-information z, then the above mentioned networks could be\nconditioned to obtain the conditioning mapping \u03c6(xm|z).\nIn other words, the key is to add up all representations and then apply nonlinear transformations.\nEquivariant model. Our goal is to design neural network layers that are equivariant to the permuta-\ntions of elements in the input x. Based on Lemma 3, a neural network layer f\u0398(x) is permutation\nequivariant if and only if all the off-diagonal elements of \u0398 are tied together and all the diagonal ele-\nments are equal as well, i.e., \u0398 = \u03bbI + \u03b3 (11T) for \u03bb, \u03b3 \u2208 R. This function is simply a non-linearity\napplied to a weighted combination of (i) its input Ix and; (ii) the sum of input values (11T)x. Since\nsummation does not depend on the permutation, the layer is permutation-equivariant. We can further\nmanipulate the operations and parameters in this layer to get other variations, e.g.:\n\nf (x)\n\n.\n= \u03c3 (\u03bbIx + \u03b3 maxpool(x)1) .\n\n(4)\nwhere the maxpooling operation over elements of the set (similar to sum) is commutative. In practice,\nthis variation performs better in some applications. This may be due to the fact that for \u03bb = \u03b3, the\ninput to the non-linearity is max-normalized. Since composition of permutation equivariant functions\nis also permutation equivariant, we can build DeepSets by stacking such layers.\n3.2 Other Related Works\nSeveral recent works study equivariance and invariance in deep networks w.r.t. general group of\ntransformations [12\u201314]. For example, [15] construct deep permutation invariant features by pairwise\n= [|xi \u2212 xj|, xi + xj] is invariant to\n.\ncoupling of features at the previous layer, where fi,j([xi, xj])\ntransposition of i and j. Pairwise interactions within sets have also been studied in [16, 17]. [18]\napproach unordered instances by \ufb01nding \u201cgood\u201d orderings.\nThe idea of pooling a function across set-members is not new. In [19], pooling was used binary\nclassi\ufb01cation task for causality on a set of samples. [20] use pooling across a panoramic projection\nof 3D object for classi\ufb01cation, while [21] perform pooling across multiple views. [22] observe the\ninvariance of the payoff matrix in normal form games to the permutation of its rows and columns\n(i.e. player actions) and leverage pooling to predict the player action. The need of permutation\nequivariance also arise in deep learning over sensor networks and multi-agent setings, where a special\ncase of Lemma 3 has been used as the architecture [23].\nIn light of these related works, we would like to emphasize our novel contributions: (i) the universality\nresult of Theorem 2 for permutation invariance that also relates DeepSets to other machine learning\ntechniques, see Sec. 3; (ii) the permutation equivariant layer of (4), which, according to Lemma 3\nidenti\ufb01es necessary and suf\ufb01cient form of parameter-sharing in a standard neural layer and; (iii) novel\napplication settings that we study next.\n\n3\n\n\f(a) Entropy\n\nfor\nGaussian\n\nestimation\nrotated of 2d\n\n(b) Mutual\n\ninformation\nestimation by varying\ncorrelation\n\n(c) Mutual\n\ninformation\nestimation by varying\nrank-1 strength\n\n(d) Mutual\n\ninformation\nrandom\n\n32d\n\non\ncovariance matrices\n\nFigure 1: Population statistic estimation: Top set of \ufb01gures, show prediction of DeepSets vs SDM for N = 210\ncase. Bottom set of \ufb01gures, depict the mean squared error behavior as number of sets is increased. SDM has\nlower error for small N and DeepSets requires more data to reach similar accuracy. But for high dimensional\nproblems DeepSets easily scales to large number of examples and produces much lower estimation error. Note\nthat the N \u00d7 N matrix inversion in SDM makes it prohibitively expensive for N > 214 = 16384.\n4 Applications and Empirical Results\nWe present a diverse set of applications for DeepSets. For the supervised setting, we apply DeepSets\nto estimation of population statistics, sum of digits and classi\ufb01cation of point-clouds, and regression\nwith clustering side-information. The permutation-equivariant variation of DeepSets is applied to\nthe task of outlier detection. Finally, we investigate the application of DeepSets to unsupervised\nset-expansion, in particular, concept-set retrieval and image tagging. In most cases we compare our\napproach with the state-of-the art and report competitive results.\n\n4.1 Set Input Scalar Response\n4.1.1 Supervised Learning: Learning to Estimate Population Statistics\n\nIn the \ufb01rst experiment, we learn entropy and mutual information of Gaussian distributions, without\nproviding any information about Gaussianity to DeepSets. The Gaussians are generated as follows:\n\u2022 Rotation: We randomly chose a 2 \u00d7 2 covariance matrix \u03a3, and then generated N sample sets from\nN (0, R(\u03b1)\u03a3R(\u03b1)T ) of size M = [300 \u2212 500] for N random values of \u03b1 \u2208 [0, \u03c0]. Our goal was\nto learn the entropy of the marginal distribution of \ufb01rst dimension. R(\u03b1) is the rotation matrix.\n\u2022 Correlation: We randomly chose a d \u00d7 d covariance matrix \u03a3 for d = 16, and then generated\nN sample sets from N (0, [\u03a3, \u03b1\u03a3; \u03b1\u03a3, \u03a3]) of size M = [300 \u2212 500] for N random values of\n\u03b1 \u2208 (\u22121, 1). Goal was to learn the mutual information of among the \ufb01rst d and last d dimension.\n\u2022 Rank 1: We randomly chose v \u2208 R32 and then generated a sample sets from N (0, I +\u03bbvvT ) of size\nM = [300 \u2212 500] for N random values of \u03bb \u2208 (0, 1). Goal was to learn the mutual information.\n\u2022 Random: We chose N random d \u00d7 d covariance matrices \u03a3 for d = 32, and using each, generated\na sample set from N (0, \u03a3) of size M = [300 \u2212 500]. Goal was to learn the mutual information.\nWe train using L2 loss with a DeepSets architecture having 3 fully connected layers with ReLU\nactivation for both transformations \u03c6 and \u03c1. We compare against Support Distribution Machines\n(SDM) using a RBF kernel [10], and analyze the results in Fig. 1.\n\n4.1.2 Sum of Digits\nNext, we compare to what happens if our set\ndata is treated as a sequence. We consider the\ntask of \ufb01nding sum of a given set of digits. We\nconsider two variants of this experiment:\nText. We randomly sample a subset of maxi-\nmum M = 10 digits from this dataset to build\n100k \u201csets\u201d of training images, where the set-\nlabel is sum of digits in that set. We test against\nsums of M digits, for M starting from 5 all the\nway up to 100 over another 100k examples.\n\nFigure 2: Accuracy of digit summation with text (left)\nand image (right) inputs. All approaches are trained on\ntasks of length 10 at most, tested on examples of length\nup to 100. We see that DeepSets generalizes better.\n\n4\n\n\fImage. MNIST8m [24] contains 8 million instances of 28 \u00d7 28 grey-scale stamps of digits in\n{0, . . . , 9}. We randomly sample a subset of maximum M = 10 images from this dataset to build\nN = 100k \u201csets\u201d of training and 100k sets of test images, where the set-label is the sum of digits in\nthat set (i.e. individual labels per image is unavailable). We test against sums of M images of MNIST\ndigits, for M starting from 5 all the way up to 50.\nWe compare against recurrent neural networks \u2013 LSTM and GRU. All models are de\ufb01ned to have\nsimilar number of layers and parameters. The output of all models is a scalar, predicting the sum of\nN digits. Training is done on tasks of length 10 at most, while at test time we use examples of length\nup to 100. The accuracy, i.e. exact equality after rounding, is shown in Fig. 2. DeepSets generalize\nmuch better. Note for image case, the best classi\ufb01cation error for single digit is around p = 0.01 for\nMNIST8m, so in a collection of N of images at least one image will be misclassi\ufb01ed is 1 \u2212 (1 \u2212 p)N ,\nwhich is 40% for N = 50. This matches closely with observed value in Fig. 2(b).\n\n4.1.3 Point Cloud Classi\ufb01cation\n\nModel\n\nInstance\nSize\n\n3D GAN [28]\n\nVoxNet [26]\n\nMVCNN [21]\n\nRepresentation\n\nAccuracy\n\n3DShapeNets\n[25]\n\n303\n\n323\n164\u00d7164\u00d7\n12\n\nA point-cloud is a set of low-dimensional vec-\ntors. This type of data is frequently encountered\nin various applications like robotics, vision, and\ncosmology. In these applications, existing meth-\nods often convert the point-cloud data to voxel\nor mesh representation as a preprocessing step,\ne.g. [26, 29, 30]. Since the output of many range\nsensors, such as LiDAR, is in the form of point-\ncloud, direct application of deep learning meth-\nods to point-cloud is highly desirable. Moreover,\nit is easy and cheaper to apply transformations,\nsuch as rotation and translation, when working\nwith point-clouds than voxelized 3D objects.\nAs point-cloud data is just a set of points, we\ncan use DeepSets to classify point-cloud repre-\nsentation of a subset of ShapeNet objects [31],\ncalled ModelNet40 [25]. This subset consists of\n3D representation of 9,843 training and 2,468\ntest instances belonging to 40 classes of objects. We produce point-clouds with 100, 1000 and 5000\nparticles each (x, y, z-coordinates) from the mesh representation of objects using the point-cloud-\nlibrary\u2019s sampling routine [32]. Each set is normalized by the initial layer of the deep network to have\nzero mean (along individual axes) and unit (global) variance. Tab. 1 compares our method using three\npermutation equivariant layers against the competition; see Appendix H for details.\n\nvoxels (using convo-\nlutional deep belief\nnet)\nvoxels (voxels from\npoint-cloud + 3D\nCNN)\nmulti-vew images\n(2D CNN + view-\npooling)\nvoxels (3D CNN,\nvariational autoen-\ncoder)\nvoxels (3D CNN,\ngenerative adversar-\nial training)\npoint-cloud\npoint-cloud\n\nTable 1: Classi\ufb01cation accuracy and the representation-\nsize used by different methods on the ModelNet40.\n\n643\n5000 \u00d7 3\n100 \u00d7 3\n\n83.3%\n90 \u00b1 .3%\n82 \u00b1 2%\n\nVRN Ensemble\n[27]\n\n323\n\nDeepSets\nDeepSets\n\n77%\n\n83.10%\n\n90.1%\n\n95.54%\n\n4.1.4 Improved Red-shift Estimation Using Clustering Information\n\nAn important regression problem in cosmology is to estimate the red-shift of galaxies, corresponding\nto their age as well as their distance from us [33] based on photometric observations. One way to\nestimate the red-shift from photometric observations is using a regression model [34] on the galaxy\nclusters. The prediction for each galaxy does not change by permuting the members of the galaxy\ncluster. Therefore, we can treat each galaxy cluster as a \u201cset\u201d and use DeepSets to estimate the\nindividual galaxy red-shifts. See Appendix G for more details.\nFor each galaxy, we have 17 photometric features from the redMaPPer\ngalaxy cluster catalog [35] that contains photometric readings for\n26,111 red galaxy clusters. Each galaxy-cluster in this catalog has\nbetween \u223c 20 \u2212 300 galaxies \u2013 i.e. x \u2208 RN (c)\u00d717, where N (c) is the\ncluster-size. The catalog also provides accurate spectroscopic red-shift\nestimates for a subset of these galaxies.\nTable 2: Red-shift experiment.\nLower scatter is better.\nWe randomly split the data into 90% training and 10% test clusters, and\nminimize the squared loss of the prediction for available spectroscopic\nred-shifts. As it is customary in cosmology literature, we report the average scatter |zspec\u2212z|\n1+zspec\nzspec is the accurate spectroscopic measurement and z is a photometric estimate in Tab. 2.\n\nMethod\nMLP\nredMaPPer\nDeepSets\n\nScatter\n0.026\n0.025\n0.023\n\n, where\n\n5\n\n\fLDA-3k (Vocab = 38k)\n\nRecall (%)\n\nMRR Med.\n\nLDA-5k (Vocab = 61k)\n\nRecall (%)\n\nMRR Med.\n\nMethod\n\nLDA-1k (Vocab = 17k)\nRecall (%)\n\nMRR Med.\n\n@10 @100 @1k\n5.9\n0.06\n0.001 8520 0.02\nRandom\n37.2 0.007 2848 2.01\n1.69\nBayes Set\n54.7 0.021\n6.00\n4.80\nw2v Near\n4.78\n53.1 0.023\n5.30\nNN-max\n48.5 0.021 1110 5.81\nNN-sum-con 4.58\n46.6 0.018 1250 5.61\nNN-max-con 3.36\n54.3 0.025\n6.04\nDeepSets\n5.53\n\n@10 @100 @1k\n2.6\n36.5 0.008\n43.2 0.016\n54.8 0.025\n0.027\n60.0\n57.5 0.026\n60.7 0.027\n\n0.6\n11.9\n28.1\n22.5\n19.8\n16.9\n24.2\n\n0.2\n14.5\n21.2\n24.9\n27.2\n25.7\n28.5\n\n0.000 28635 0.01\n1.75\n4.03\n4.72\n4.87\n4.72\n5.54\n\n3234\n2054\n672\n453\n570\n426\n\n@10 @100 @1k\n1.6\n0.000 30600\n34.5 0.007 3590\n35.2 0.013 6900\n47.0 0.022 1320\n731\n53.9 0.022\n877\n51.8 0.022\n55.5 0.026\n616\n\n0.2\n12.5\n16.7\n21.4\n23.5\n22.0\n26.1\n\n641\n779\n\n696\n\nm\n\nS(X) =\n\n(cid:88)\n\nlog p({xm}|\u03b1)\n\nTable 3: Results on Text Concept Set Retrieval on LDA-1k, LDA-3k, and LDA-5k. Our DeepSets model\noutperforms other methods on LDA-3k and LDA-5k. However, all neural network based methods have inferior\nperformance to w2v-Near baseline on LDA-1k, possibly due to small data size. Higher the better for recall@k\nand mean reciprocal rank (MRR). Lower the better for median rank (Med.)\n4.2 Set Expansion\nIn the set expansion task, we are given a set of objects that are similar to each other and our goal is\nto \ufb01nd new objects from a large pool of candidates such that the selected new objects are similar\nto the query set. To achieve this one needs to reason out the concept connecting the given set and\nthen retrieve words based on their relevance to the inferred concept. It is an important task due to\nwide range of potential applications including personalized information retrieval, computational\nadvertisement, tagging large amounts of unlabeled or weakly labeled datasets.\nGoing back to de Finetti\u2019s theorem in Sec. 3.2, where we consider the marginal probability of a set of\nobservations, the marginal probability allows for very simple metric for scoring additional elements\nto be added to X. In other words, this allows one to perform set expansion via the following score\n(5)\nNote that s(x|X) is the point-wise mutual information between x and X. Moreover, due to exchange-\nability, it follows that regardless of the order of elements we have\n\ns(x|X) = log p(X \u222a {x}|\u03b1) \u2212 log p(X|\u03b1)p({x}|\u03b1)\n\ns (xm|{xm\u22121, . . . x1}) = log p(X|\u03b1) \u2212 M(cid:88)\n\n(6)\nWhen inferring sets, our goal is to \ufb01nd set completions {xm+1, . . . xM} for an initial set of query\nterms {x1, . . . , xm}, such that the aggregate set is coherent. This is the key idea of the Bayesian\nSet algorithm [36] (details in Appendix D). Using DeepSets, we can solve this problem in more\ngenerality as we can drop the assumption of data belonging to certain exponential family.\nFor learning the score s(x|X), we take recourse to large-margin classi\ufb01cation with structured loss\nfunctions [37] to obtain the relative loss objective l(x, x(cid:48)|X) = max(0, s(x(cid:48)|X)\u2212s(x|X)+\u2206(x, x(cid:48))).\nIn other words, we want to ensure that s(x|X) \u2265 s(x(cid:48)|X) + \u2206(x, x(cid:48)) whenever x should be added\nand x(cid:48) should not be added to X.\nConditioning. Often machine learning problems do not exist in isolation. For example, task like tag\ncompletion from a given set of tags is usually related to an object z, for example an image, that needs\nto be tagged. Such meta-data are usually abundant, e.g. author information in case of text, contextual\ndata such as the user click history, or extra information collected with LiDAR point cloud.\nConditioning graphical models with meta-data is often complicated. For instance, in the Beta-\nBinomial model we need to ensure that the counts are always nonnegative, regardless of z. Fortunately,\nDeepSets does not suffer from such complications and the fusion of multiple sources of data can be\ndone in a relatively straightforward manner. Any of the existing methods in deep learning, including\nfeature concatenation by averaging, or by max-pooling, can be employed. Incorporating these meta-\ndata often leads to signi\ufb01cantly improved performance as will be shown in experiments; Sec. 4.2.2.\n4.2.1 Text Concept Set Retrieval\nIn text concept set retrieval, the objective is to retrieve words belonging to a \u2018concept\u2019 or \u2018cluster\u2019,\ngiven few words from that particular concept. For example, given the set of words {tiger, lion,\ncheetah}, we would need to retrieve other related words like jaguar, puma, etc, which belong to\nthe same concept of big cats. This task of concept set retrieval can be seen as a set completion task\nconditioned on the latent semantic concept, and therefore our DeepSets form a desirable approach.\nDataset. We construct a large dataset containing sets of NT = 50 related words by extracting\ntopics from latent Dirichlet allocation [38, 39], taken out-of-the-box1. To compare across scales, we\n\nm=1\n\n1github.com/dmlc/experimental-lda\n\n6\n\n\fconsider three values of k = {1k, 3k, 5k} giving us three datasets LDA-1k, LDA-3k, and LDA-5k,\nwith corresponding vocabulary sizes of 17k, 38k, and 61k.\nMethods. We learn this using a margin loss with a DeepSets architecture having 3 fully connected\nlayers with ReLU activation for both transformations \u03c6 and \u03c1. Details of the architecture and training\nare in Appendix E. We compare to several baselines: (a) Random picks a word from the vocabulary\nuniformly at random. (b) Bayes Set [36]. (c) w2v-Near computes the nearest neighbors in the\nword2vec [40] space. Note that both Bayes Set and w2v NN are strong baselines. The former\nruns Bayesian inference using Beta-Binomial conjugate pair, while the latter uses the powerful\n300 dimensional word2vec trained on the billion word GoogleNews corpus2. (d) NN-max uses a\nsimilar architecture as our DeepSets but uses max pooling to compute the set feature, as opposed\nto sum pooling. (e) NN-max-con uses max pooling on set elements but concatenates this pooled\nrepresentation with that of query for a \ufb01nal set feature. (f) NN-sum-con is similar to NN-max-con\nbut uses sum pooling followed by concatenation with query representation.\nEvaluation. We consider the standard retrieval metrics \u2013 recall@K, median rank and mean re-\nciprocal rank, for evaluation. To elaborate, recall@K measures the number of true labels that were\nrecovered in the top K retrieved words. We use three values of K = {10, 100, 1k}. The other two\nmetrics, as the names suggest, are the median and mean of reciprocals of the true label ranks, respec-\ntively. Each dataset is split into TRAIN (80%), VAL (10%) and TEST (10%). We learn models using\nTRAIN and evaluate on TEST, while VAL is used for hyperparameter selection and early stopping.\nResults and Observations. As seen in Tab. 3: (a) Our DeepSets model outperforms all other\napproaches on LDA-3k and LDA-5k by any metric, highlighting the signi\ufb01cance of permutation\ninvariance property. (b) On LDA-1k, our model does not perform well when compared to w2v-Near.\nWe hypothesize that this is due to small size of the dataset insuf\ufb01cient to train a high capacity neural\nnetwork, while w2v-Near has been trained on a billion word corpus. Nevertheless, our approach\ncomes the closest to w2v-Near amongst other approaches, and is only 0.5% lower by Recall@10.\n\n4.2.2 Image Tagging\n\nESP game\n\nIAPRTC-12.5\nP R F1 N+\nP R F1 N+\n35 19 25 215 40 19 26 198\nLeast Sq.\n18 19 18 209 24 23 23 223\nMBRM\n24 19 21 222 29 19 23 211\nJEC\n46 22 30 247 47 26 34 280\nFastTag\nLeast Sq.(D) 44 32 37 232 46 30 36 218\n44 32 37 229 46 33 38 254\nFastTag(D)\n39 34 36 246 42 31 36 247\nDeepSets\n\nMethod\n\nWe next experiment with image tagging, where the task\nis to retrieve all relevant tags corresponding to an image.\nImages usually have only a subset of relevant tags, there-\nfore predicting other tags can help enrich information that\ncan further be leveraged in a downstream supervised task.\nIn our setup, we learn to predict tags by conditioning\nDeepSets on the image, i.e., we train to predict a partial\nset of tags from the image and remaining tags. At test time,\nwe predict tags from the image alone.\nDatasets. We report results on the following three\ndatasets - ESPGame, IAPRTC-12.5 and our in-house\ndataset, COCO-Tag. We refer the reader to Appendix F,\nfor more details about datasets.\nMethods. The setup for DeepSets to tag images is sim-\nilar to that described in Sec. 4.2.1. The only difference\nbeing the conditioning on the image features, which is\nconcatenated with the set feature obtained from pooling individual element representations.\nBaselines. We perform comparisons against several baselines, previously reported in [41]. Speci\ufb01-\ncally, we have Least Sq., a ridge regression model, MBRM [42], JEC [43] and FastTag [41]. Note\nthat these methods do not use deep features for images, which could lead to an unfair comparison. As\nthere is no publicly available code for MBRM and JEC, we cannot get performances of these models\nwith Resnet extracted features. However, we report results with deep features for FastTag and Least\nSq., using code made available by the authors 3.\nEvaluation. For ESPgame and IAPRTC-12.5, we follow the evaluation metrics as in [44]\u2013precision\n(P), recall (R), F1 score (F1), and number of tags with non-zero recall (N+). These metrics are evaluate\nfor each tag and the mean is reported (see [44] for further details). For COCO-Tag, however, we use\nrecall@K for three values of K = {10, 100, 1000}, along with median rank and mean reciprocal rank\n(see evaluation in Sec. 4.2.1 for metric details).\n\nTable 4: Results of image tagging on\nESPgame and IAPRTC-12.5 datasets. Perfor-\nmance of our DeepSets approach is roughly\nsimilar to the best competing approaches, ex-\ncept for precision. Refer text for more details.\nHigher the better for all metrics \u2013 precision\n(P), recall (R), f1 score (F1), and number of\nnon-zero recall tags (N+).\n\n2code.google.com/archive/p/word2vec/\n3http://www.cse.wustl.edu/~mchen/\n\n7\n\n\fFigure 3: Each row shows a set, constructed from CelebA dataset, such that all set members except for an\noutlier, share at least two attributes (on the right). The outlier is identi\ufb01ed with a red frame. The model is\ntrained by observing examples of sets and their anomalous members, without access to the attributes. The\nprobability assigned to each member by the outlier detection network is visualized using a red bar at the bottom\nof each image. The probabilities in each row sum to one.\n\nResults and Observations. Tab. 4 shows results of im-\nage tagging on ESPgame and IAPRTC-12.5, and Tab. 5\non COCO-Tag. Here are the key observations from Tab. 4:\n(a) performance of our DeepSets model is comparable to\nthe best approaches on all metrics but precision, (b) our\nrecall beats the best approach by 2% in ESPgame. On\nfurther investigation, we found that the DeepSets model\nretrieves more relevant tags, which are not present in list of\nground truth tags due to a limited 5 tag annotation. Thus,\nthis takes a toll on precision while gaining on recall, yet\nyielding improvement on F1. On the larger and richer COCO-Tag, we see that the DeepSets approach\noutperforms other methods comprehensively, as expected. Qualitative examples are in Appendix F.\n\nTable 5: Results on COCO-Tag dataset.\nClearly, DeepSets outperforms other base-\nlines signi\ufb01cantly. Higher the better for re-\ncall@K and mean reciprocal rank (MRR).\nLower the better for median rank (Med).\n\nMethod\nw2v NN (blind)\nDeepSets (blind)\nDeepSets\n\nMRR Med.\n\n823\n310\n28\n\nRecall\n\n@10 @100 @1k\n5.6\n9.0\n31.4\n\n20.0\n39.2\n73.4\n\n54.2 0.021\n71.3 0.044\n95.3 0.131\n\n4.3 Set Anomaly Detection\n\nThe objective here is to \ufb01nd the anomalous face in each set, simply by observing examples and without\nany access to the attribute values. CelebA dataset [45] contains 202,599 face images, each annotated\nwith 40 boolean attributes. We build N = 18, 000 sets of 64 \u00d7 64 stamps, using these attributes each\ncontaining M = 16 images (on the training set) as follows: randomly select 2 attributes, draw 15\nimages having those attributes, and a single target image where both attributes are absent. Using a\nsimilar procedure we build sets on the test images. No individual person\u2018s face appears in both train\nand test sets. Our deep neural network consists of 9 2D-convolution and max-pooling layers followed\nby 3 permutation-equivariant layers, and \ufb01nally a softmax layer that assigns a probability value to\neach set member (Note that one could identify arbitrary number of outliers using a sigmoid activation\nat the output). Our trained model successfully \ufb01nds the anomalous face in 75% of test sets. Visually\ninspecting these instances suggests that the task is non-trivial even for humans; see Fig. 3.\nAs a baseline, we repeat the same experiment by using a set-pooling layer after convolution layers,\nand replacing the permutation-equivariant layers with fully connected layers of same size, where the\n\ufb01nal layer is a 16-way softmax. The resulting network shares the convolution \ufb01lters for all instances\nwithin all sets, however the input to the softmax is not equivariant to the permutation of input images.\nPermutation equivariance seems to be crucial here as the baseline model achieves a training and test\naccuracy of \u223c 6.3%; the same as random selection. See Appendix I for more details.\n\n5 Summary\n\nIn this paper, we develop DeepSets, a model based on powerful permutation invariance and equivari-\nance properties, along with the theory to support its performance. We demonstrate the generalization\nability of DeepSets across several domains by extensive experiments, and show both qualitative and\nquantitative results. In particular, we explicitly show that DeepSets outperforms other intuitive deep\nnetworks, which are not backed by theory (Sec. 4.2.1, Sec. 4.1.2). Last but not least, it is worth noting\nthat the state-of-the-art we compare to is a specialized technique for each task, whereas our one\nmodel, i.e., DeepSets, is competitive across the board.\n\n8\n\n\fReferences\n[1] B. Poczos, A. Rinaldo, A. Singh, and L. Wasserman. Distribution-free distribution regression.\nIn International Conference on AI and Statistics (AISTATS), JMLR Workshop and Conference\nProceedings, 2013. pages 1\n\n[2] I. Jung, M. Berges, J. Garrett, and B. Poczos. Exploration and evaluation of ar, mpca and kl\nanomaly detection techniques to embankment dam piezometer data. Advanced Engineering\nInformatics, 2015. pages 1\n\n[3] M. Ntampaka, H. Trac, D. Sutherland, S. Fromenteau, B. Poczos, and J. Schneider. Dynamical\nmass measurements of contaminated galaxy clusters using machine learning. The Astrophysical\nJournal, 2016. URL http://arxiv.org/abs/1509.05409. pages 1\n\n[4] M. Ravanbakhsh, J. Oliva, S. Fromenteau, L. Price, S. Ho, J. Schneider, and B. Poczos. Esti-\nmating cosmological parameters from the dark matter distribution. In International Conference\non Machine Learning (ICML), 2016. pages 1\n\n[5] J. Oliva, B. Poczos, and J. Schneider. Distribution to distribution regression. In International\n\nConference on Machine Learning (ICML), 2013. pages 1\n\n[6] Z. Szabo, B. Sriperumbudur, B. Poczos, and A. Gretton. Learning theory for distribution\n\nregression. Journal of Machine Learning Research, 2016. pages\n\n[7] K. Muandet, D. Balduzzi, and B. Schoelkopf. Domain generalization via invariant feature\nrepresentation. In In Proceeding of the 30th International Conference on Machine Learning\n(ICML 2013), 2013. pages\n\n[8] K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Schoelkopf. Learning from distributions\nvia support measure machines. In In Proceeding of the 26th Annual Conference on Neural\nInformation Processing Systems (NIPS 2012), 2012. pages 1, 3\n\n[9] Felix A. Faber, Alexander Lindmaa, O. Anatole von Lilienfeld, and Rickard Armiento. Machine\nlearning energies of 2 million elpasolite (abC2D6) crystals. Phys. Rev. Lett., 117:135502, Sep\n2016. doi: 10.1103/PhysRevLett.117.135502. pages 1\n\n[10] B. Poczos, L. Xiong, D. Sutherland, and J. Schneider. Support distribution machines, 2012.\n\nURL http://arxiv.org/abs/1202.0302. pages 3, 4\n\n[11] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for\n\nlearning latent variable models. arXiv preprint arXiv:1210.7559, 2012. pages 3\n\n[12] Robert Gens and Pedro M Domingos. Deep symmetry networks.\n\ninformation processing systems, pages 2537\u20132545, 2014. pages 3\n\nIn Advances in neural\n\n[13] Taco S Cohen and Max Welling. Group equivariant convolutional networks. arXiv preprint\n\narXiv:1602.07576, 2016. pages\n\n[14] Siamak Ravanbakhsh, Jeff Schneider, and Barnabas Poczos. Equivariance through parameter-\n\nsharing. arXiv preprint arXiv:1702.08389, 2017. pages 3\n\n[15] Xu Chen, Xiuyuan Cheng, and St\u00e9phane Mallat. Unsupervised deep haar scattering on graphs.\n\nIn Advances in Neural Information Processing Systems, pages 1709\u20131717, 2014. pages 3\n\n[16] Michael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional\nobject-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016.\npages 3\n\n[17] Nicholas Guttenberg, Nathaniel Virgo, Olaf Witkowski, Hidetoshi Aoki, and Ryota Kanai.\nPermutation-equivariant neural networks applied to dynamics prediction. arXiv preprint\narXiv:1612.04530, 2016. pages 3\n\n[18] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for\n\nsets. arXiv preprint arXiv:1511.06391, 2015. pages 3\n\n[19] David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Sch\u00f6lkopf, and L\u00e9on Bottou.\n\nDiscovering causal signals in images. arXiv preprint arXiv:1605.08179, 2016. pages 3\n\n9\n\n\f[20] Baoguang Shi, Song Bai, Zhichao Zhou, and Xiang Bai. Deeppano: Deep panoramic repre-\nsentation for 3-d shape recognition. IEEE Signal Processing Letters, 22(12):2339\u20132343, 2015.\npages 3, 26, 27\n\n[21] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convo-\nlutional neural networks for 3d shape recognition. In Proceedings of the IEEE International\nConference on Computer Vision, pages 945\u2013953, 2015. pages 3, 5, 26, 27\n\n[22] Jason S Hartford, James R Wright, and Kevin Leyton-Brown. Deep learning for predicting\nIn Advances in Neural Information Processing Systems, pages\n\nhuman strategic behavior.\n2424\u20132432, 2016. pages 3\n\n[23] Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropa-\n\ngation. In Neural Information Processing Systems, pages 2244\u20132252, 2016. pages 3\n\n[24] Ga\u00eblle Loosli, St\u00e9phane Canu, and L\u00e9on Bottou. Training invariant support vector machines\nusing selective sampling. In L\u00e9on Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston,\neditors, Large Scale Kernel Machines, pages 301\u2013320. MIT Press, Cambridge, MA., 2007.\npages 5\n\n[25] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and\nJianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 1912\u20131920, 2015.\npages 5, 26\n\n[26] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-\ntime object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International\nConference on, pages 922\u2013928. IEEE, 2015. pages 5, 26\n\n[27] Andrew Brock, Theodore Lim, JM Ritchie, and Nick Weston. Generative and discriminative\nvoxel modeling with convolutional neural networks. arXiv preprint arXiv:1608.04236, 2016.\npages 5, 26\n\n[28] Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T Freeman, and Joshua B Tenenbaum.\nLearning a probabilistic latent space of object shapes via 3d generative-adversarial modeling.\narXiv preprint arXiv:1610.07584, 2016. pages 5, 26\n\n[29] Siamak Ravanbakhsh, Junier Oliva, Sebastien Fromenteau, Layne C Price, Shirley Ho, Jeff\nSchneider, and Barnab\u00e1s P\u00f3czos. Estimating cosmological parameters from the dark matter\ndistribution. In Proceedings of The 33rd International Conference on Machine Learning, 2016.\npages 5\n\n[30] Hong-Wei Lin, Chiew-Lan Tai, and Guo-Jin Wang. A mesh reconstruction algorithm driven by\n\nan intrinsic property of a point cloud. Computer-Aided Design, 36(1):1\u20139, 2004. pages 5\n\n[31] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li,\nSilvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d\nmodel repository. arXiv preprint arXiv:1512.03012, 2015. pages 5\n\n[32] Radu Bogdan Rusu and Steve Cousins. 3D is here: Point Cloud Library (PCL). In IEEE\nInternational Conference on Robotics and Automation (ICRA), Shanghai, China, May 9-13\n2011. pages 5\n\n[33] James Binney and Michael Merri\ufb01eld. Galactic astronomy. Princeton University Press, 1998.\n\npages 5, 25\n\n[34] AJ Connolly, I Csabai, AS Szalay, DC Koo, RG Kron, and JA Munn. Slicing through multicolor\nspace: Galaxy redshifts from broadband photometry. arXiv preprint astro-ph/9508100, 1995.\npages 5, 25\n\n[35] Eduardo Rozo and Eli S Rykoff. redmapper ii: X-ray and sz performance benchmarks for the\n\nsdss catalog. The Astrophysical Journal, 783(2):80, 2014. pages 5, 25\n\n[36] Zoubin Ghahramani and Katherine A Heller. Bayesian sets. In NIPS, volume 2, pages 22\u201323,\n\n2005. pages 6, 7, 20, 21, 22\n\n10\n\n\f[37] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In S. Thrun, L. Saul, and\nB. Sch\u00f6lkopf, editors, Advances in Neural Information Processing Systems 16, pages 25\u201332,\nCambridge, MA, 2004. MIT Press. pages 6\n\n[38] Jonathan K. Pritchard, Matthew Stephens, and Peter Donnelly. Inference of population structure\nusing multilocus genotype data. Genetics, 155(2):945\u2013959, 2000. ISSN 0016-6731. URL\nhttp://www.genetics.org/content/155/2/945. pages 6, 22\n\n[39] David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. Latent dirichlet allocation.\n\nJournal of Machine Learning Research, 3:2003, 2003. pages 6, 22\n\n[40] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-\nsentations of words and phrases and their compositionality. In Advances in neural information\nprocessing systems, pages 3111\u20133119, 2013. pages 7, 22\n\n[41] Minmin Chen, Alice Zheng, and Kilian Weinberger. Fast image tagging. In Proceedings of The\n\n30th International Conference on Machine Learning, pages 1274\u20131282, 2013. pages 7, 23\n\n[42] S. L. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image and\nvideo annotation. In Proceedings of the 2004 IEEE Computer Society Conference on Computer\nVision and Pattern Recognition, CVPR\u201904, pages 1002\u20131009, Washington, DC, USA, 2004.\nIEEE Computer Society. pages 7, 23\n\n[43] Ameesh Makadia, Vladimir Pavlovic, and Sanjiv Kumar. A new baseline for image annotation.\nIn Proceedings of the 10th European Conference on Computer Vision: Part III, ECCV \u201908,\npages 316\u2013329, Berlin, Heidelberg, 2008. Springer-Verlag. pages 7, 23\n\n[44] Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, and Cordelia Schmid. Tagprop:\nDiscriminative metric learning in nearest neighbor models for image auto-annotation.\nIn\nComputer Vision, 2009 IEEE 12th International Conference on, pages 309\u2013316. IEEE, 2009.\npages 7, 23, 24\n\n[45] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\nwild. In Proceedings of International Conference on Computer Vision (ICCV), 2015. pages 8\n[46] Branko \u00b4Curgus and Vania Mascioni. Roots and polynomials as homeomorphic spaces. Exposi-\n\ntiones Mathematicae, 24(1):81\u201395, 2006. pages 13, 15\n\n[47] Boris A Khesin and Serge L Tabachnikov. Arnold: Swimming Against the Tide, volume 86.\n\nAmerican Mathematical Society, 2014. pages 15\n\n[48] Jerrold E Marsden and Michael J Hoffman. Elementary classical analysis. Macmillan, 1993.\n\npages 15\n\n[49] Nicolas Bourbaki. El\u00e9ments de math\u00e9matiques: th\u00e9orie des ensembles, chapitres 1 \u00e0 4, volume 1.\n\nMasson, 1990. pages 15\n\n[50] C. A. Micchelli. Interpolation of scattered data: distance matrices and conditionally positive\n\nde\ufb01nite functions. Constructive Approximation, 2:11\u201322, 1986. pages 18\n\n[51] Luis Von Ahn and Laura Dabbish. Labeling images with a computer game. In Proceedings of\nthe SIGCHI conference on Human factors in computing systems, pages 319\u2013326. ACM, 2004.\npages 23\n\n[52] Michael Grubinger. Analysis and evaluation of visual information systems performance, 2007.\nURL http://eprints.vu.edu.au/1435. Thesis (Ph. D.)\u2013Victoria University (Melbourne,\nVic.), 2007. pages 23\n\n[53] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr\nDoll\u00e1r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European\nConference on Computer Vision, pages 740\u2013755. Springer, 2014. pages 23\n\n[54] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014. pages 25, 26, 27\n\n[55] Djork-Arn\u00e9 Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network\nlearning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015. pages 27\n\n11\n\n\f", "award": [], "sourceid": 1939, "authors": [{"given_name": "Manzil", "family_name": "Zaheer", "institution": "Carnegie Mellon University"}, {"given_name": "Satwik", "family_name": "Kottur", "institution": "Carnegie Mellon University"}, {"given_name": "Siamak", "family_name": "Ravanbakhsh", "institution": "CMU/UBC"}, {"given_name": "Barnabas", "family_name": "Poczos", "institution": "Carnegie Mellon University"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}, {"given_name": "Alexander", "family_name": "Smola", "institution": "Amazon - We are hiring!"}]}