{"title": "Trading robust representations for sample complexity through self-supervised visual experience", "book": "Advances in Neural Information Processing Systems", "page_first": 9617, "page_last": 9627, "abstract": "Learning in small sample regimes is among the most remarkable features of the human perceptual system. This ability is related to robustness to transformations, which is acquired through visual experience in the form of weak- or self-supervision during development. We explore the idea of allowing artificial systems to learn representations of visual stimuli through weak supervision prior to downstream supervised tasks. We introduce a novel loss function for representation learning using unlabeled image sets and video sequences, and experimentally demonstrate that these representations support one-shot learning and reduce the sample complexity of multiple recognition tasks. We establish the existence of a trade-off between the sizes of weakly supervised, automatically obtained from video sequences, and fully supervised data sets. Our results suggest that equivalence sets other than class labels, which are abundant in unlabeled visual experience, can be used for self-supervised learning of semantically relevant image embeddings.", "full_text": "Trading robust representations for sample complexity\n\nthrough self-supervised visual experience\n\nAndrea Tacchetti\u2217\n\nStephen Voinea\n\nGeorgios Evangelopoulos\u2020\n\nThe Center for Brains, Minds and Machines, MIT\n\nMcGovern Institute for Brain Research at MIT\n\nCambridge, MA, USA\n\n{atacchet, voinea, gevang}@mit.edu\n\nAbstract\n\nLearning in small sample regimes is among the most remarkable features of the\nhuman perceptual system. This ability is related to robustness to transformations,\nwhich is acquired through visual experience in the form of weak- or self-supervision\nduring development. We explore the idea of allowing arti\ufb01cial systems to learn\nrepresentations of visual stimuli through weak supervision prior to downstream su-\npervised tasks. We introduce a novel loss function for representation learning using\nunlabeled image sets and video sequences, and experimentally demonstrate that\nthese representations support one-shot learning and reduce the sample complexity\nof multiple recognition tasks. We establish the existence of a trade-off between\nthe sizes of weakly supervised, automatically obtained from video sequences, and\nfully supervised data sets. Our results suggest that equivalence sets other than\nclass labels, which are abundant in unlabeled visual experience, can be used for\nself-supervised learning of semantically relevant image embeddings.\n\n1\n\nIntroduction\n\nTransformation invariance and learning in small sample regimes are among the most remarkable\nabilities of the human perceptual system, and arguably the ones that have proven most dif\ufb01cult to\nreplicate in arti\ufb01cial systems. For example, while humans effortlessly recognize new faces after\nseeing a single image, despite changes in pose, illumination, or facial expression, convolutional neural\nnetworks require thousands of examples to achieve similar degrees of generalization. Crucially, these\ntwo abilities are mathematically and computationally related in the sense that robust representations\nof perceptual input support low-sample generalization [1, 6, 27, 8, 30].\nNeuroscientists have long debated whether exposure to speci\ufb01c modes of visual experience, such\nas spatial proximity or sequential presentation, is necessary to learn visual representations that are\nrobust to complex transformations, or if most of these abilities are innate and independent of visual\nexperience [23, 24, 29]. Experiments on chicks reared in highly controlled visual environments\nrevealed that being exposed to temporally smooth object transitions during development is necessary\nto acquire robustness to 3D rotations [36]. Similarly, newborn monkeys deprived of any visual\nexperience of faces were found not to exhibit a face selective cortical area [4]. These results highlight\nthat exposure to a naturalistic visual experience is key to learning invariant representations, and more\nin general to the development of a powerful visual system.\nMotivated by these \ufb01ndings, we aim to bridge the sample complexity divide between arti\ufb01cial and\nbiological perception systems. We argue that while humans can learn new visual concepts from few\nexamples, they do so by relying on the rich and exhaustive visual experience acquired during and\n\n\u2217Currently with DeepMind. \u2020Currently with X, Alphabet.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Image orbits for self-supervised learning from videos: Images of an orbit sequence from the\n\u201cLate Night\u201d video face dataset (top row), random samples from distinct orbits (middle row) and their detected\ncanonical, frontal view (bottom row).\nafter development. By contrast, arti\ufb01cial systems typically rely on explicit supervision of individual\ninstances for learning representations and prediction functions from their inputs. In this work, we\nconsider the idea of allowing arti\ufb01cial systems to learn representations of images through weak\nsupervision, prior to performing supervised visual tasks in low sample regimes. We consider as a\nnatural source of weak supervision the existence of classes or groupings in the input space that are\nnot necessarily related to any downstream learning task, but which encode equivalence relations.\nExamples are the set of images of an object under rotations [2, 14] or the frames of a video of a\nmoving object [20, 40, 33]. More in general, these equivalence relations can be, temporal, categorical\nor generative, and partition the space in sets which we will loosely refer to as orbits in this paper.\nIn order to explicitly use the information in such partitions of a training set, we introduce a new\nloss function that promotes representations that are robust to inter-orbit and selective to intra-orbit\nrelations. Using the same, deep convolutional parametrization, we compare and contrast our approach\nto reference methods: the supervised triplet loss [25], the ranking loss [33], the surrogate class loss\n[10] and, when a full model of the orbit-generating process is available (e.g. af\ufb01ne transformations),\nspatial transformer networks [16].\nWe demonstrate how image embeddings learned through weak supervision using our novel loss\nfunction and orbit sets support one-shot learning in a variety of settings and reduce the sample\ncomplexity of challenging recognition tasks. Furthermore, we establish the existence of a trade-off\nbetween the sizes of weakly supervised and fully supervised data sets. Overall, our results suggest\nthat partitioning observations into equivalence sets, similar to what is achieved through unlabeled\nvisual experience or self-supervision, and using the proposed loss, one can learn semantically relevant\nembeddings and lower the sample complexity of downstream visual tasks.\nRelated work: Theoretical aspects of representations invariant to transformations have been studied\nthrough the properties of parameterization choices such as convolutions [1, 6, 27, 8, 3]. Robustness\nand equivariance has been also sought through explicit representation [38, 15] or estimation of\ntransformation parameters.\nSimilarity information has been used as weak supervision in variants of the triplet loss function for\ndeep metric [28] and embedding learning [33]. Supervised versions of the triplet loss have been\nused for discriminatively-trained metric learning, through convolutional neural networks (CNNs), by\nminimizing the true objective of the task, e.g. face veri\ufb01cation [7, 25]. The triplet loss was also used\nfor learning transformation-invariant embeddings [14].\nMethods using self-supervision for learning or pre-training representations exploit temporal con-\ntinuity [20, 40, 12, 33, 17, 34], spatial proximity (context, inpainting) [9] or data manipulation\n(transformation, colorization) [18]. The latter include predicting rotations applied to images [11] and\nmatching pixels between sheared and original images [22]. De\ufb01ning surrogate classes populated by\ndata augmentation transformations was explored in the exemplar CNN framework [10].\n\n2 De\ufb01nitions and related loss functions\nLet \u03a6 : X \u2192 F be a representation on input space X , selected from some hypothesis space\nH \u2286 {\u03a6| \u03a6 :X \u2192 F}. We will assume X and F to be Euclidean spaces, and that elements of H\ncan be parameterized for learning (e.g. \u03a6 is a deep CNN). Embedding maps \u03a6 can be learned by\n\n2\n\n\fminimizing a loss function L that is suitable for the task at hand. Loss functions can be de\ufb01ned\non the input space L : X \u00d7 X \u2192 R+, the feature space L : F \u00d7 F \u2192 R+, or the output space\nL : Y \u00d7Y \u2192 R+; moreover loss functions can be combined to provide more structure to the learning\nproblem.\nTransformations and orbit sets A family of transformations is a set of maps G \u2282 {g | g : X \u2192 X}\nwhere gx is the action of the transformation. Transformations in G act on x to generate points x(cid:48), i.e.\nx(cid:48) = gx = g(x). Transformations can be parametrized by \u03b8j \u2208 \u0398, so that G = {gj = g\u03b8j | \u03b8j \u2208 \u0398}\n(e.g. rigid translation or rotation across the center).\nDe\ufb01nition 1 (Group orbits [1]). An orbit associated to an element x \u2208 X is the set of points that can\nbe reached under the transformations G, i.e., Ox = {gx \u2208 X|g \u2208 G} \u2282 X .\nTriplet loss: The triplet loss [25] enforces that \u03a6 maps points into the embedding space so that\npairwise intra-class dissimilarities are smaller than pairwise inter-class dissimilarities. Let (xi, xj)\nbe image pairs, then by de\ufb01ning triplets T = {(xi, xp, xq)| yi = yp, yi (cid:54)= yq}, where yi \u2208 Y is the\nclass label, the loss enforces\n\nL(\u03a6(xi), \u03a6(xp)) \u2264 L(\u03a6(xi), \u03a6(xq)) \u2212 \u03b1\n\n(1)\n\nby minimizing a variant of the large margin loss [35]:\n\n|T |(cid:88)\n\nmin\n\u03a6\u2208H\n\n|L(\u03a6(xi), \u03a6(xp)) + \u03b1 \u2212 L(\u03a6(xi), \u03a6(xq))|+ ,\n\n(2)\nwhere \u03b1 \u2208 R+ a distance margin for the non-matching pairs, |\u00b7|+ = max{0,\u00b7} is the hinge loss and\nL is a measure of dissimilarity in the output space (e.g. Euclidian distance).\nReconstruction loss: In addition to the encoding map \u03a6 : X \u2192 F in H, we can consider a\ncorresponding decoding map \u02dc\u03a6 :F \u2192 X in \u02dcH and minimize the reconstruction or autoencoder error:\n\ni=1\n\nL(xi, \u02dc\u03a6 \u25e6 \u03a6(xi))\n\n(3)\n\nn(cid:88)\n\ni=1\n\nmin\n\n\u03a6\u2208H, \u02dc\u03a6\u2208 \u02dcH\n\nn(cid:88)\n\nki\u2264|G|(cid:88)\n\ni=1\n\nj=1\n\nmin\n\n\u03a6\u2208H,f\u2208Hf\n\nover n training points. The two maps usually have related structures (e.g. tied weights, convolutional-\ndeconvolutional) [5]. The loss function L is again a measure of dissimilarity in the input space (e.g.\nEuclidean distance or mean squared error).\nExemplar loss: A surrogate class can be formed by applying random transformations sampled from\nsome known G to each point in an unlabeled set {xi}n\ni=1. The exemplar loss [10] minimizes a\ndiscriminative loss L (e.g. the cross-entropy loss) with respect to these surrogate classes:\n\nL(i, f (\u03a6(gjxi)))\n\n(4)\n\nwhere i indexes the original, untransformed training set and serves as the surrogate class label for\npoints generated from xi; f is a classi\ufb01er learned jointly with the embedding \u03a6. By learning to\nclassify transformed images according to their source untransformed example, the exemplar loss\npromotes the learning of embeddings that are robust to transformations in G.\nSpatial transformer networks (STNs) When a plausible forward model of the action of g \u2208 G is\nknown and has a suitable parametrization, STNs [16] learn to undo transformations in G.\n\n3 Representation learning using a novel loss for transformation sets\nLet Xn = {xi}n\ni=1 \u2208 X be a set of unlabeled instances and assume X = Rd, for example xi being\nvectorized images. We aim to learn an embedding \u03a6 : X \u2192 F, in some F = Rk with k \u2264 d, such\nthat the corresponding metric\nD(x, x(cid:48)\n(5)\nwhere || \u00b7 ||F denotes the norm in F, is robust to transformation sets and selective for sets that are not\nequivalent up to transformations. The above condition is equivalently written as:\n(6)\nwhere we use an \u0001-approximation to the zero-norm distance for exact invariance, and Ox is a generic\norbit-set (see below). Note that the requirement for selectivity, i.e., the converse direction, makes\n(Rd, D) a proper metric space.\n\nx(cid:48) \u223c x \u21d4 x, x(cid:48) \u2208 Ox \u21d4 D(x, x(cid:48)\n\n)||2F , D : X \u00d7 X \u2192 R+\n\n) = ||\u03a6(x) \u2212 \u03a6(x(cid:48)\n\n) \u2264 \u0001,\n\n3\n\n\f3.1 Orbit sets for weak- or self-supervision\nFollowing De\ufb01nition 1, orbits are sets stemming from equivalence relations in X . For example,\ngiven a group structure on the transformations G, the input space X is partitioned into orbits as\nx \u223c x(cid:48) \u21d4 \u2203g \u2208 G : x(cid:48) = gx,\u2200x, x(cid:48) \u2208 X . We relax this de\ufb01nition to include set memberships\nprovided by categorical labels as well as other transformations that do not enjoy a group structure.\nDe\ufb01nition 2 (Generic orbits). An orbit associated to x \u2208 X is the subset of X that includes x along\nwith an equivalence relation: Ox = {x(cid:48) \u2208 X|x \u223c x(cid:48)} \u2282 X , given by a function c : X \u2192 C such\nthat x \u223c x(cid:48) \u21d4 c(x) = c(x(cid:48)).\nExamples of c are the labels of a supervised learning task, the indexes of vector quantization\ncodewords or, for the case of sequential data such as videos, the (sub)sequence membership, with\nC the set of classes, codewords or sequences respectively (as in Fig. 1, row 1). Generic orbits can\nbe obtained from data in an explicit, e.g. by transforming points, or an implicit way, e.g. through\nauxiliary tasks and groupings using weak-supervision. Here are some examples:\nVirtual examples Given a parametrized set of transformations, orbits can be generated by randomly\nsampling from the parameter vectors {\u03b8j \u2208 \u0398} and letting Ox = {g\u03b8j x| \u03b8j \u2208 \u0398} for a given x.\nExamples include geometric transformations, e.g. rotation, translation, scaling [21] or typical data\naugmentation transforms, such as cropping, contrast, color, blur or illumination [10].\nAcquisition If the data acquisition process can be designed ad hoc, or if its characteristics are encoded\nin meta-data, e.g. multiple samples of an object across time, conditions or views, then an orbit can be\nassociated to all samples from a session [13].\nSelf-supervision For sequential data such as videos, an orbit can be a continuous segment of the\nvideo stream, or an object/object part detected and tracked across time [33], following plausible\nexpectations on feature smoothness and continuity of the representation [40, 17].\nIn the following, for a given Xn, we obtain the set of orbits {Oxi} either via an auxiliary supervision\nsignal if available (video sequence, or images session index) so that Xn = \u222axiOxi, or by augmen-\ntation of each xi \u2208 Xn so that Oxi = {g\u03b8j x| \u03b8j \u2208 \u0398}. We further assign a canonical example\nxc \u2208 Oxi to each orbit.\nDe\ufb01nition 3 (Orbit canonical example). The canonical example is an arbitrary chosen point in the\norbit set that provides a reference coordinate system for the transformations and is consistent across\norbits. For Ox obtained through virtual examples, xc is the result of the identity transformation\ng0 \u2208 G, i.e. xc = g0x = x. Otherwise, xc is chosen or detected as the no-pose or neutral condition\nexample (Fig. 1, bottom row).\n\n3.2 Loss function\n\nGiven the training set orbits, we de\ufb01ne a set of triplets of points\n\nT \u2282 {(xi, xp, xq)| xi \u2208 Xn, xp \u2208 Oxi, xq \u2208 Oxq ;Oxi \u2229 Oxp = \u2205}\n\n(7)\nsuch that each xi is assigned a positive example xp (in-orbit), i.e. xi \u223c xp \u21d4 xi, xp \u2208 Oxi and a\nnegative example xq (out-of-orbit), i.e. Oxq \u2229 Oxi = \u2205. The proposed loss function is composed of\ntwo terms; (1) a discriminative term, based on the triplet loss, using distances between the encodings\n\u03a6 : Rd \u2192 Rk on the feature space Rk\n\n(cid:12)(cid:12)(cid:12)(cid:107)\u03a6(xi) \u2212 \u03a6(xp)(cid:107)2Rk +\u03b1\u2212(cid:107)\u03a6(xi) \u2212 \u03a6(xq)(cid:107)2Rk\n\n(cid:12)(cid:12)(cid:12)+\n\nLt(xi, xp, xq) =\n\n(8)\nwith \u03b1 a distance margin, and (2) a reconstruction error between the decoder output \u02dc\u03a6 : Rk \u2192 Rd\nand the orbit canonical, as a distance on the input space Rd\n\n,\n\n(cid:13)(cid:13)(cid:13)xc \u2212 \u02dc\u03a6 \u25e6 \u03a6(xi)\n(cid:13)(cid:13)(cid:13)2\n\nRd\n\nLt(xi, xp, xq) +\n\n\u03bb2\nd\n\n, xc \u2208 Oxi.\n(cid:19)\n\nLe(xi, xc)\n\n,\n\n(9)\n\n(10)\n\nThe representation learning problem is then formulated as\n\nLe(xi, xc) =\n\n|T |(cid:88)\n\n(cid:18) \u03bb1\n\nk\n\ni=1\n\nmin\n\u03a6, \u02dc\u03a6\n\nwhere the hyperparameters \u03bb1, \u03bb2 control the relative contribution of the two terms.\n\n4\n\n\fMNIST\nEven/Odd MNIST\nNORB\n\nEmbedding\nTrain x 32\n\nEven Train x 32\nTrain (25 objects)\n\nTrain/Test/Validation\n\nTest x 32: 10-fold\n\nOdd Test x 32: 10-fold\n\nTest (15/10 objects, 1000 re-splits)\n\nMulti-PIE 2 Sessions (4 choose 2)\nLate Night\n\n500 sequences\n\n2 Sessions (4 choose 2)\n\n28 x 3 sequences\n\nMNIST\nEven/Odd MNIST\nNORB\nMulti-PIE\nLate Night\n\nTransformations\n\n2D af\ufb01ne\n2D af\ufb01ne\n3D pose\n\n3D viewpoint\nUnconstrained\n\nOrbits\n\nRandom af\ufb01ne (32)\nRandom af\ufb01ne (32)\n\n3D pose (162)\n\n3D viewpoint (13)\n\n500 detected sequences\n\nCanonical\nOriginal\nOriginal\nFrontal\nFrontal\n\nMin yaw face\n\nTable 1: Datasets, orbits and canonical de\ufb01nitions used in our experimental evaluations.\n\n3.2.1 Orbit triplet (OT) loss\n\nFor \u03bb2 = 0 the loss reduces to the triplet loss in Eq. (2), with the orbit identity as label. Points that lie\non the same orbit are pulled together and points on different orbits are pushed apart. The minimizer is\ndriven to Eq. (1), using all triplets in the training set, which is satis\ufb01ed in the theoretical minimum of\nLt. The orbit triplet loss resembles a Siamese network architecture [7], with a tied weights embedding\ntrained using triplets as input.\n\n3.2.2 Orbit encoder (OE) recti\ufb01cation loss\nFor \u03bb1 = 0 the loss reduces to the reconstruction error of the canonical xc of the orbit Oxi by the\ndecoded \u03a6(xi). Equivalently, this is the error of the recti\ufb01cation that \u02dc\u03a6 \u25e6 \u03a6 : X \u2192 X applies on\nthe input xi, which is some transformation of xc. This is a novel autoencoder loss, a generalization\nof denoising autoencoders (which promote reconstruction of clean versions from noisy inputs) to\ntransformations with or without an explicit generative model. In this case the encoder-decoder pair\n\u02dc\u03a6 \u25e6 \u03a6 learns to de-transform all elements of an orbit by aligning arbitrary samples (Fig. 1, middle\nrow) to the orbit canonical element (Fig. 1, bottom row).\nConceptually, the orbit encoder loss enforces selectivity on \u03a6 by preserving suf\ufb01cient information to\nreconstruct the input irrespective of the transformation. For computing the loss across pairs (xi, xc)\nfrom different orbits, the choice of xc has to be consistent only across orbits of the same semantic\nclass of a downstream task, e.g. faces or numbers, and is otherwise arbitrary.\n\n4 Experimental evaluation of learned image representations\n\nWe systematically evaluate representations learned from different datasets and applied to different\nvisual recognition tasks using the same parametrization, different loss functions, and varying degrees\nof supervision (unsupervised, supervised, weakly-supervised). Our experimental procedure is explic-\nitly designed to probe the impact of learning robust representations using generic orbit sets on the\nperformance in downstream tasks. To this end, we separate the representation learning step from the\nsupervised learning task and purposefully choose extremely low samples (e.g. 1 example per class)\nand simple classi\ufb01ers (e.g. 1-NN) for the latter. Moreover, we select datasets where de\ufb01ning orbit\nsets is intuitive and done by either generating virtual examples, or by exploiting the characteristics of\nthe data acquisition process. Table 1 provides an overview of the datasets, transformations, orbit and\ncanonical examples de\ufb01nitions in our evaluations. Overall, the sets include 3D viewpoint, light and\nunconstrained transformations; MNIST is the only one with analytic, known transformation orbits.\nEvaluation procedure: For each dataset, we employ an Embedding set to learn representations\nand a separate Validation set for determining the optimal number of SGD iterations (early stopping\nhyperparameter) by maximizing the mean, across multiple re-splits of the set, of task-speci\ufb01c\nperformance metrics. The representations are used for encoding Train and Test sets from a Supervised\nset for the downstream recognition task. Performance is reported as the mean/std of multiple train/test\nre-splits of the Supervised set. For all experiments, there is no overlap between embedding, validation,\nand supervised sets, or between train/test splits, within the validation and the supervised sets. These\nrules were enforced at the level of orbit subsets, a stronger requirement than excluding single samples.\n\n5\n\n\fFigure 2: Two-dimensional t-SNE visualizations of face image embeddings from 10 randomly sampled\nsubjects/classes (coded by different colors) of Multi-PIE test set.\n\nEmbeddings: We compare embeddings learned using the proposed loss, termed Orbit Joint (OJ)\nfrom Eq. (10); the two special cases, Orbit Triplet (OT) (\u03bb2 = 0) and Orbit Encode (OE) (\u03bb1 = 0);\nand three, closely related losses: standard Autoencoder (AE), Supervised Triplet (ST) [25] and\nExemplar (EX) [10]. OT also corresponds, analytically and empirically, to using the ranking loss\n(RL) [33] with the orbit-based triplets de\ufb01ned in our work. For the af\ufb01ne MNIST evaluations, we\nalso used an embedding featuring a Spatial Transformer Networks module [16], trained with orbit\nsupervision (OT-STN) or full supervision (ST-STN). It is worth highlighting that we included the ST\nand AE embeddings only as a way of providing a sense for the scale of the quantities we consider.\nST is a fully supervised method (i.e. the partitioning of the embedding dataset is de\ufb01ned according\nto the same class labels as the downstream task) and it provides a performance ceiling to weakly\nsupervised methods (similarly ST-STN). AE on the other hand is purely unsupervised (i.e. there is no\npartitioning at all, each image stands on its own), and serves as a performance \ufb02oor.\nNetwork and training: For the encoder, we used a deep convolutional network with architecture\nfollowing VGG networks [26]. Each layer is composed of a series of convolutions with a small 3 \u00d7 3\nkernel (of stride 1, padding 1), batch normalization and Recti\ufb01er Linear Unit (ReLU) activations. A\nspatial max pooling layer (of stride 2 and size either 2 \u00d7 2 or 4 \u00d7 4) was used every two such layers\nof convolutions. The number of channels doubled after each max pooling layer, ranging from 16\nto 128 for MNIST, 64 to 512 for Multi-PIE and the Late Night Dataset and 16 to 256 for NORB\n(image sizes were single channel, 40 \u00d7 40 px for MNIST, 3-channel, 128 \u00d7 128 px for Multi-PIE,\nsingle channel 96 \u00d7 96 px for NORB and 3-channel 96 \u00d7 96 px for the Late Night dataset). Four\niterations of convolution and pooling were followed by a \ufb01nal fully-connected layer of size 1024. The\ndecoder is a deconvolutional network, reversing the series of encoder operations, using convolutional\nreconstruction and max unpooling [37], in direct correspondence to the encoder in the number of\nlayers, \ufb01lters per layer and size of kernels. Encoder and decoder weights were tied with free biases.\nLoss minimization was carried out with mini-batch Stochastic Gradient Descent using the Adam\noptimizer. For MNIST, we used mini-batches of size 256, for Multi-PIE, 72, for NORB, 256, for\nMulti-PIE, 72 and for the Late Night dataset, 128 images. The selection of triplets for ST, OT and\nOJ followed the soft negative selection process from [25]. The values for \u03bb1 and \u03bb2 were set equal\n(\u03bb1 = \u03bb2 = 1). The STN modules consisted of two max pooling-convolution-ReLU blocks with 20\n\ufb01lters of size 5 \u00d7 5 (stride 1), pooling regions of size 2 \u00d7 2 and no overlap, followed by two linear\nlayers. Where applied, these were inserted between the input and the rest of the network.\nSummary of results A summary of the test set performance in all tasks and sets is provided in\nTable 2. Notably the performance of OJ is either better or statistically indistinguishable (with a\nstandard signi\ufb01cance threshold at p < 0.05) from OT and OE. This observation makes the case for\nthe joint loss, which can result in substantial improvements like in the one-shot classi\ufb01cation task on\nMNIST with af\ufb01ne transformations. This also suggests that more optimal, e.g. by cross-validation,\nselection of the hyperparameters \u03bb1 and \u03bb2 could lead to further performance gains.\n\n4.1 Af\ufb01ne transformations: MNIST\n\nWe \ufb01rst validated our method on MNIST using af\ufb01ne transformations to generate orbit sets. This\nallows us to illustrate our method on a simple dataset with easily de\ufb01ned orbits. We transformed\neach image in the original set with 32 random af\ufb01ne transformations. Transformations were sam-\npled uniformly from rotations in [\u221290\u25e6, 90\u25e6], shearing factor in [\u22120.3, 0.3], scale in [0.7, 1.3], and\ntranslation in [\u221215, 15] pixels in each dimension. The embedding set consisted of the original\ntraining set (50 \u00d7 103 images), augmented by 32 transformations for each sample, resulting in a\ntotal of 1650 \u00d7 103 images. Images were grouped in 50 \u00d7 103 orbits, each one a 33-sized set of the\n\n6\n\nSTOJEXOTOEAE\fST [25]\n\nOJ\n\nEX [10]\n\nOT-STN\nMNIST 0.97\u00b10.02 0.67\u00b10.05 0.44\u00b10.06 0.37\u00b10.04 0.40\u00b10.05 0.97\u00b10.01 0.34\u00b10.03\nEven/Odd 0.49\u00b10.08 0.59\u00b10.08 0.55\u00b10.07 0.52\u00b10.07 0.54\u00b10.07 0.51\u00b10.02 0.54\u00b10.02\nNORB 0.67\u00b10.13 0.59\u00b10.12 0.58\u00b10.11 0.55\u00b10.12 0.54\u00b10.11\nM-PIE AUC 0.99\u00b10.00 0.95\u00b10.01 0.89\u00b10.01 0.92\u00b10.01 0.87\u00b10.02\nPrecision 0.99\u00b10.00 0.91\u00b10.01 0.71\u00b10.04 0.93\u00b10.02 0.83\u00b10.02\n\nOT, RL [33]\n\nST-STN\n\nOE\n\nTable 2: Performance evaluation using embedding-validation-test splits. Entries in bold denote\nsigni\ufb01cant performance difference between the proposed loss (OJ) and the corresponding weakly\nsupervised losses EX, OT, OE (p-values less than 0.05, Bonferroni-corrected paired t-test). STN\ncolumns denote the use of additional spatial transformer modules [16].\n\nST\nOJ\n\nEX\nOT\n\nOE\nAE\n\nOT-STN\nST-STN\n\nST\nOJ\n\nEX\nOT\n\nOE\nAE\n\nOT-STN\nST-STN\n\nST\nOJ\n\nEX\nOT\n\nOE\nAE\n\ny\nc\na\nr\nu\nc\nc\nA\nn\no\ni\nt\na\nc\n\ufb01\n\ni\ns\ns\na\nl\nC\n\n1.0\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\ny\nc\na\nr\nu\nc\nc\nA\nn\no\ni\nt\na\nc\n\ufb01\n\ni\ns\ns\na\nl\nC\n\n1.0\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\ny\nc\na\nr\nu\nc\nc\nA\nn\no\ni\nt\na\nc\n\ufb01\n\ni\ns\ns\na\nl\nC\n\n1.0\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\nMNIST\n\nEven/Odd MNIST\n\nNORB\n\nFigure 3: Classi\ufb01cation accuracy on one-shot learning (1-NN) using learned image embeddings.\nBox-plots obtained with multiple train/test splits (Table 2).\n\noriginal (canonical example) and its random transformations. We also used 10 random re-splits of the\naugmented test set for the validation and train/test-sets.\nThe learned embeddings were employed in a one-shot classi\ufb01cation task, using one image per\nclass and Nearest Neighbor classi\ufb01cation (1-NN). Classi\ufb01cation accuracy over the 10 re-splits is\nshown in Fig. 3 (left). The proposed loss OJ achieved top accuracy among the weakly-supervised\nmethods, followed by EX and OE. Note how the addition of spatial transformer modules provided no\nimprovement either with full or with orbit supervision.\n\n4.2 Transfer learning: even/odd MNIST\n\nWe designed a simple transfer learning setting by using exclusively images of even digits in the\nembedding set and odd digits in all other sets. The task consisted of a new 5-way, one-shot classi\ufb01ca-\ntion and the effect in accuracy can be seen in Fig. 3 (middle). In this case, the OJ loss signi\ufb01cantly\noutperforms the supervised ST. Sampling (randomly) from the transformations and using them to\nde\ufb01ne \ufb01ner equivalence classes was more helpful than having access to class-level information.\n\n4.3\n\n3D-af\ufb01ne transformations: NORB\n\nTo assess robustness to changes in illumination and 3D pose, we used the NORB-small dataset [19].\nThe dataset contains images of 50 toys from 5 generic categories acquired from 162 different 3D\nviewpoints and under 6 lighting condition. The embedding set consisted of 5 objects per category\nfrom the original training set, and the validation and train/test sets from 2 and 3 objects per category\nfrom the original test respectively. The task was one-shot classi\ufb01cation with 1-NN, with 1000 random\nre-splits in validation and train/test-sets (Fig. 3 (right)). This is the only dataset in our evaluations in\nwhich OJ performance falls on the same range as the orbit triplet and the exemplar loss.\n\n4.4 Face transformations: Multi-PIE\n\nThe Multi-PIE dataset [13] contains images of faces of 129 individuals, captured from 13 distinct\nviewpoints and under 20 different illumination conditions. Acquisition was carried out across four\nsessions, resulting in a dataset of 129 \u00d7 13 \u00d7 20 \u00d7 4 = 134(cid:48)160 images. We used six splits (4 choose\n2) by session (2 sessions for embedding \u2013 1 for validation and 1 for test). For learning the embeddings,\n\n7\n\n\fthe ST method had access to the face identity for each image, thus considering equivalence classes\nformed by all images of the same subject (across sessions, viewpoints and illumination conditions).\nThe weakly-supervised methods (OJ, OT, OE, EX) had only access to the set of 129 \u00d7 20 \u00d7 3 orbits\nin the embedding set, each corresponding to all 13 viewpoints for a single identity, illumination\ncondition and session.\nFigure 2 shows the relative distance landscape, as 2D t-SNE plots [31], for all images from 10 random\nsubjects of the test set, encoded using embeddings learned with different losses. We used two distance-\nbased tasks to evaluate performance: a same-different face veri\ufb01cation task and a face retrieval task,\nmeasuring the Area Under the ROC Curve (AUC) and the mean top-1 precision respectively. The\nproposed OJ loss achieves the best AUC score among the weakly-supervised methods (Table 2), and\nsimilarly for the top-1 precision scores, with the notable exception of OT outperforming OJ.\nFor the veri\ufb01cation, we used all unique pairwise distances in the embedding space. For the retrieval,\nwe selected the closest point to a query (top-1) from a target set. We considered each test image as\nquery, and the rest of the test set as target set (after removing all same-identity images (32 in total,\nincluding the query) at the same illumination (regardless of viewpoint) and at the same viewpoint\n(regardless of illumination) to ensure we were evaluating the preference of identity over appearance.\n\n5 Self-supervised embeddings from videos\n\nVideo sequences provide an interpretable and rich form of self-supervision through the temporal\nprogression of a scene, an event, or a moving/transforming object. In this section we focus on\nexploiting this source of weak supervision to lower the sample complexity of a supervised downstream\ntask. That is, by learning robust embeddings from video sequences, we are able to lower the number\nof labelled examples required to achieve a \ufb01xed performance level in a downstream recognition task.\n\n5.1 Late Night face dataset\n\nWe collected a dataset of human faces video-clips for learning embeddings and testing recognition\nand transfer learning. The sources were YouTube video clips from the of\ufb01cial channel of \u201cCONAN\",\na late night TV talk show with face transformations such as 3D pose changes and deformations, e.g.\nfacial expressions and dialogue.\nAutomatic orbit extraction: The collection pipeline was designed to extract contiguous clips of\nindividually cropped faces (Fig. 1, top row) and was as follows: (1) Detect scene changes (i.e.\ncamera cuts) and assign each frame to a unique, uninterrupted sequence (2) Detect and crop faces,\nindependently in each frame, using a high-recall face detector [32]. (3) Enhance detection precision\nby using a high-precision face detector on the cropped faces [39]. (4) Construct orbits using the\ndetection positions in adjacent frames \u2013 regions in subsequent frames were assigned to the same orbit\nif their bounding boxes overlapped, else, a new orbit was started from the latter detection. (5) Retain\nlargest orbit (number of images) from each video. (6) Choose a canonical example for each orbit as\nthe face image with the least yaw displacement from a frontal pose (estimated by the face detector) \u2013\nin case of multiple candidates, choose the one with the highest detection con\ufb01dence score [39].\nEmbedding set: From processing 500 unlabeled source videos, we collected 271(cid:48)415 images across\n500 orbits, which were not manually validated or quality assured in any way. Crucially, individuals\nappear in more than one orbit.\nEvaluation set: We collected 84 orbit clips of 28 faces (3 orbits per identity), not appearing in the\nembedding set. In this case, we manually assured for quality that all images in each orbit belong to\nthe same identity and labelled the dataset according to face identity. This resulted in 47(cid:48)919 images,\nin 28 categories and 84 orbits.\n\n5.2 Sample complexity and embedding complexity\n\nLearned embeddings were used to encode the evaluation set for a 28-class discrimination task using a\nlinear SVM. To measure the gains in sample complexity between the different methods, we randomly\nsampled 20 training sets (containing 1 to 20 images per class \u2013resampled 10 times) and one test set\nwith 64 images per class (images from the same orbit never appeared in both sets). Results are shown\nin Fig. 4 (left), where one can note that OJ outperforms all baselines and special cases, in the small\n\n8\n\n\fOJ\nOT\n\nOE\nEX\n\nAE\n\n1.0\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\ny\nc\na\nr\nu\nc\nc\nA\nn\no\ni\nt\na\nc\n\ufb01\ni\ns\ns\na\nl\nC\n\ns\ns\na\nl\nC\n\nr\ne\np\n\ns\ne\nl\np\nm\na\nx\nE\n\n20\n14\n\n6\n5\n4\n3\n\n2\n\n1\n\n1.0\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n1\n\n2\n\n3\n\n4\n\n5 6\n\n14\n\n20\n\nExamples per Class\n\n101\n\n102\n\nOrbits\n\nFigure 4: (left) Sample complexity: Classi\ufb01cation accuracy (mean with standard error across 10\nre-samples) vs.\ntraining set size on the Late Night face dataset. The task was a 28-way face\ndiscrimination (linear SVM), with embeddings learned on a separate set (500 orbits). ST is not\navailable due to the lack of class labeling on the, automatically constructed, embedding set. (right)\nEmbedding and sample complexity trade-off: Accuracy map (mean across 10 train/test re-splits\nof the validation set) of OJ for 1-20 labeled examples per class for classi\ufb01er learning and 10-500\norbits for embedding learning.\n\nsample regions of the plot, e.g. for classi\ufb01cation with 7 samples per class or lower. This comparison\nalso highlights the effect that the representation can have for learning tasks on a label budget.\nNext we considered the trade-off in data resources for learning the embedding and learning a predictor\non a separate set. In Fig. 4 we demonstrate this as a two-dimensional map of classi\ufb01cation accuracy\nfor OJ on the Late Night dataset, by letting the number of labeled examples change between 1-20 per\nclass (as in the sample complexity plot) and the number of orbits between 10-500. Note how the map\nis dominated by large equi-accuracy regimes, which de\ufb01ne contiguous performance regions. Moving\nalong those regions, or \ufb01xing the performance requirements, is equivalent to trading supervised\nresources (labelled examples) for the, automatically extracted (change detection, face detection,\ntracking) and unlabelled, embedding resources. The demonstration of this trade-off is, to the best of\nour knowledge, a new \ufb01nding.\n\n6 Conclusions\n\nMotivated by \ufb01ndings in the neuroscience of vision on the necessary role of visual experience in\nthe development of robust visual perception, we considered image representation learning using\nself- or weak- supervision from unlabeled image sets and videos. We proposed a novel loss function\nthat combines a discriminative and a recti\ufb01cation component of complementary objectives. The\nscheme supersedes state-of-the-art, exemplar- and ranking-based losses and, when applicable spatial\ntransformer modules, in distance-based and classi\ufb01cation tasks. In addition, we \ufb01nd that the learned\nrepresentations reduce the \u201clabel budget\u201d of supervised learning by trading it for \u201cfree\u201d self- or weak-\nsupervision. From a practical point, our work suggests that partitioning the training set to equivalence\nclasses de\ufb01ned from sampled or implicitly acquired transformations, is a useful weak-supervision\nsignal for extracting embeddings that are semantically relevant and economically learned. Future\nwork will involve orbit evaluation and sampling on natural and complex image and video datasets.\n\nAcknowledgments\n\nWe would like to thank Tomaso Poggio for his advice and supervision throughout the project and the\nMcGovern Institute for Brain Research at MIT for supporting this research. The DGX-1 used for our\nexperiments was donated by NVIDIA. This material is based upon work supported by the Center for\nBrains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216.\n\n9\n\n\fReferences\n[1] F. Anselmi, L. Rosasco, and T. Poggio. On invariance and selectivity in representation learning. Information\n\nand Inference, 2015.\n\n[2] F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, and T. Poggio. Unsupervised learning of\n\ninvariant representations. Theoretical Computer Science, 2016.\n\n[3] F. Anselmi, G. Evangelopoulos, L. Rosasco, and T. Poggio. Symmetry-adapted representation learning.\n\nPattern Recognition, 2019.\n\n[4] M. J. Arcaro, P. F. Schade, J. L. Vincent, C. R. Ponce, and M. S. Livingstone. Seeing faces is necessary for\n\nface-domain formation. Nature Neuroscience, 2017.\n\n[5] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE\n\nPAMI, 2013.\n\n[6] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE PAMI, 2013.\n\n[7] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to\n\nface veri\ufb01cation. In IEEE CVPR, 2005.\n\n[8] T. S. Cohen and M. Welling. Group equivariant convolutional networks. In ICML, 2016.\n\n[9] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction.\n\nIn IEEE ICCV, 2015.\n\n[10] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and T. Brox. Discriminative unsupervised\n\nfeature learning with exemplar convolutional neural networks. IEEE PAMI, 2016.\n\n[11] S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image\n\nrotations. In ICLR, 2018.\n\n[12] R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun. Unsupervised learning of spatiotemporally\n\ncoherent metrics. In IEEE ICCV, 2015.\n\n[13] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-PIE. Image and Vision Computing, 2010.\n\n[14] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In IEEE\n\nCVPR, 2006.\n\n[15] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In ICANN, 2011.\n\n[16] M. Jaderberg, K. Simonyan, and A. Zisserman. Spatial transformer networks. In NeurIPS, 2015.\n\n[17] D. Jayaraman and K. Grauman. Slow and steady feature analysis: Higher order temporal coherence in\n\nvideo. In IEEE ICCV, 2016.\n\n[18] G. Larsson, M. Maire, and G. Shakhnarovich. Colorization as a proxy task for visual understanding. In\n\nIEEE CVPR, 2017.\n\n[19] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to\n\npose and lighting. In IEEE CVPR, 2004.\n\n[20] H. Mobahi, R. Collobert, and J. Weston. Deep learning from temporal coherence in video. In ICML, 2009.\n\n[21] P. Niyogi, F. Girosi, and T. Poggio. Incorporating prior information in machine learning by creating virtual\n\nexamples. Proceedings of the IEEE, 1998.\n\n[22] D. Novotn\u00fd, S. Albanie, D. Larlus, and A. Vedaldi. Self-supervised learning of geometrically stable\n\nfeatures through probabilistic introspection. In IEEE CVPR, 2018.\n\n[23] G. Perry, E. T. Rolls, and S. M. Stringer. Spatial vs temporal continuity in view invariant visual object\n\nrecognition learning. Vision Research, 2006.\n\n[24] G. Perry, E. T. Rolls, and S. M. Stringer. Continuous transformation learning of translation invariant\n\nrepresentations. Experimental Brain Research, 2010.\n\n[25] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A uni\ufb01ed embedding for face recognition and\n\nclustering. In IEEE CVPR, 2015.\n\n10\n\n\f[26] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nICLR, 2014.\n\n[27] S. Soatto and A. Chiuso. Visual representations: De\ufb01ning properties and deep approximations. In ICLR,\n\n2016.\n\n[28] H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature\n\nembedding. In IEEE CVPR, 2016.\n\n[29] A. Tacchetti, L. Isik, and T. Poggio. Invariant recognition drives neural representations of action sequences.\n\nPLOS Computational Biology, 2017.\n\n[30] A. Tacchetti, L. Isik, and T. A. Poggio. Invariant recognition shapes neural representations of visual input.\n\nAnnual Review of Vision Science, 2018.\n\n[31] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research,\n\n2008.\n\n[32] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In IEEE CVPR,\n\n2001.\n\n[33] X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In IEEE ICCV,\n\n2015.\n\n[34] X. Wang, K. He, and A. Gupta. Transitive invariance for self-supervised visual representation learning. In\n\nIEEE ICCV, 2017.\n\n[35] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation.\n\nJournal of Machine Learning Research, 2009.\n\n[36] J. N. Wood and S. M. Wood. The development of invariant object recognition requires visual experience\n\nwith temporally smooth objects. Cognitive Science, 2018.\n\n[37] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. Deconvolutional networks. In IEEE CVPR, 2010.\n\n[38] J. Zhao, M. Mathieu, R. Goroshin, and Y. LeCun. Stacked what-where auto-encoders. In ICLR, 2016.\n\n[39] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In IEEE\n\nCVPR, 2012.\n\n[40] W. Zou, S. Zhu, K. Yu, and A. Ng. Deep learning of invariant features via simulated \ufb01xations in video. In\n\nNeurIPS, 2012.\n\n11\n\n\f", "award": [], "sourceid": 5877, "authors": [{"given_name": "Andrea", "family_name": "Tacchetti", "institution": "DeepMind"}, {"given_name": "Stephen", "family_name": "Voinea", "institution": "MIT"}, {"given_name": "Georgios", "family_name": "Evangelopoulos", "institution": "X, Alphabet Inc."}]}