{"title": "Learning Deep Disentangled Embeddings With the F-Statistic Loss", "book": "Advances in Neural Information Processing Systems", "page_first": 185, "page_last": 194, "abstract": "Deep-embedding methods aim to discover representations of a domain that make explicit the domain's class structure and thereby support few-shot learning. Disentangling methods aim to make explicit compositional or factorial structure. We combine these two active but independent lines of research and propose a new paradigm suitable for both goals. We propose and evaluate a novel loss function based on the $F$ statistic, which describes the separation of two or more distributions. By ensuring that distinct classes are well separated on a subset of embedding dimensions, we obtain embeddings that are useful for few-shot learning. By not requiring separation on all dimensions, we encourage the discovery of disentangled representations. Our embedding method matches or beats state-of-the-art, as evaluated by performance on recall@$k$ and few-shot learning tasks. Our method also obtains performance superior to a variety of alternatives on disentangling, as evaluated by two key properties of a disentangled representation: modularity and explicitness. The goal of our work is to obtain more interpretable, manipulable, and generalizable deep representations of concepts and categories.", "full_text": "Learning Deep Disentangled Embeddings\n\nWith the F-Statistic Loss\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nMichael C. Mozer\n\nUniversity of Colorado\n\nBoulder, Colorado\n\nmozer@colorado.edu\n\nKarl Ridgeway\n\nUniversity of Colorado\n\nand Sensory, Inc.\nBoulder, Colorado\n\nkarl.ridgeway@colorado.edu\n\nAbstract\n\nDeep-embedding methods aim to discover representations of a domain that make\nexplicit the domain\u2019s class structure and thereby support few-shot learning. Dis-\nentangling methods aim to make explicit compositional or factorial structure. We\ncombine these two active but independent lines of research and propose a new\nparadigm suitable for both goals. We propose and evaluate a novel loss function\nbased on the F statistic, which describes the separation of two or more distribu-\ntions. By ensuring that distinct classes are well separated on a subset of embedding\ndimensions, we obtain embeddings that are useful for few-shot learning. By not\nrequiring separation on all dimensions, we encourage the discovery of disentan-\ngled representations. Our embedding method matches or beats state-of-the-art, as\nevaluated by performance on recall@k and few-shot learning tasks. Our method\nalso obtains performance superior to a variety of alternatives on disentangling, as\nevaluated by two key properties of a disentangled representation: modularity and\nexplicitness. The goal of our work is to obtain more interpretable, manipulable,\nand generalizable deep representations of concepts and categories.\n\nThe literature on deep embeddings (Chopra et al., 2005; Yi et al., 2014a; Schroff et al., 2015; Ustinova\n& Lempitsky, 2016; Song et al., 2016; Vinyals et al., 2016; Snell et al., 2017) addresses the problem\nof discovering representations of a domain that make explicit a particular property of the domain\ninstances. We refer to this property as class or category or identity. For example, a set of animal\nimages might be embedded such that animals of the same species lie closer to one another in the\nembedding space than to animals of a different species. Deep-embedding methods are trained using\na class-aware oracle which can be queried to indicate whether two instances are of the same or\ndifferent class. Because this paradigm can handle an arbitrary number of classes, and because the\ncomplete set of classes does not have to be speci\ufb01ed in advance\u2014as they would be in an ordinary\nclassi\ufb01er\u2014deep embeddings are useful for few-shot learning. A small set of examples of novel\nclasses can be projected into the embedding space, and an unknown instance can be classi\ufb01ed by its\nproximity to the embeddings of the labeled examples.\nSimilar to deep embeddings, the literature on disentangling attempts to discover representations of a\nset of instances, but rather than making explicit a single property of the instances (class), the goal\nis to make explicit multiple, independent properties, which we refer to as factors. For example, a\ndisentangled representation of animals might include factors indicating its size, length of its ears, and\nwhether it has feet or \ufb01ns. We will later be more rigorous in de\ufb01ning a disentangled representation,\nbut for now we operate with the informal notion that the factors form a compositional or distributed\nrepresentation such that with relatively few factors and relatively few values of each factor, the\nfactor values can be recombined to span the set of instances. Disentangling has been explored using\neither a fully unsupervised procedure (Chen et al., 2016; Higgins et al., 2017) or a semi-supervised\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fprocedure in which a factor-aware oracle can be queried to specify a factor along with sets of\ninstances partitioned by factor value (Reed et al., 2014; Kingma et al., 2014; Kulkarni et al., 2015;\nKaraletsos et al., 2015; Reed et al., 2015).\nDespite their overlapping and related goals, surprisingly little effort has been made to connect research\nin deep embeddings and disentangling. There are two obvious ways to make the connection. First, a\nfactor-aware oracle might be used to train deep embeddings (instead of a class-aware oracle), and\nhopefully disentangled representations would emerge. Second, a class-aware oracle might be used to\ntrain disentangled representations (instead of a factor-aware oracle), and hopefully an embedding\nsuitable for few-shot learning would emerge. We primarily pursue the former approach, but brie\ufb02y\nexplore the latter as well.\nIn the next section, we propose a deep-embedding method that is suitable for both few-shot learning\nof novel classes and for disentangling factors. After describing the algorithm and showing that it\nobtains state-of-the-art results on the recall@k task that is ordinarily used to evaluate embeddings,\nwe turn to analyzing how well the algorithm disentangles the factors that contribute to class identity.\nTo perform a rigorous evaluation, we put forth formal, quanti\ufb01able criteria for disentanglement,\nand we show that our algorithm outperforms other state-of-the-art deep-embedding methods and\ndisentanglement methods in achieving these criteria.\n\n1 Using the F statistic to separate classes\n\nDeep-embedding methods attempt to discover a nonlinear projection such that instances of the same\nclass lie close together in the embedding space and instances of different classes lie far apart. The\nalgorithms mostly have heuristic criteria for determining how close is close and how far is far, and\nthey terminate learning once a solution meets the criterion. The criterion can be speci\ufb01ed by a\nuser-adjustable margin parameter (Schroff et al., 2015; Chopra et al., 2005) or by ensuring that\nevery within-class pair is closer together than any of the between-class pairs (Ustinova & Lempitsky,\n2016). We propose a method that determines when to terminate using the currency of probability and\nstatistical hypothesis testing. It also aligns dimensions of the embedding space with the underlying\ngenerative factors\u2014categorical and semantic features\u2014and thereby facilitates the disentangling of\nrepresentations.\nFor expository purposes, consider two classes, C = {1, 2}, having n1 and n2 instances, which\nare mapped to a one-dimensional embedding. The embedding coordinate of instance j of class\ni is denoted zij. The goal of any embedding procedure is to separate the coordinates of the two\nclasses. In our approach, we quantify the separation via the probability that the true class means in\nthe underlying environment, \u00b51 and \u00b52, are different from one another. Our training goal can thus\nbe formulated as minimizing Pr (\u00b51 = \u00b52 | s(z), n1, n2), where s(z) denotes summary statistics of\nthe labeled embedding points. This posterior is intractable, so instead we operate on the likelihood\nPr (s(z) | \u00b51 = \u00b52, n1, n2) as a proxy.\nWe borrow a particular statistic from analysis of variance (ANOVA) hypothesis testing for equality of\nmeans. The statistic is a ratio of between-class variability to within-class variability:\n\n(cid:80)\n(cid:80)\ni ni(\u00afzi \u2212 \u00af\u00afz)2\ni,j(zij \u2212 \u00afzi)2\n\ns = \u02dcn\n\nwhere \u00afzi = (cid:104)zij(cid:105) and \u00af\u00afz = (cid:104)\u00afzi(cid:105) are expectations and \u02dcn = n1 + n2 \u2212 2. Under the null hypothesis\n\u00b51 = \u00b52 and an additional normality assumption, zij \u223c N (\u00b5, \u03c32), our statistic s is a draw from a\nFisher-Snedecor (or F ) distribution with degrees of freedom 1 and \u02dcn, S \u223c F1,\u02dcn. Large s indicate\nthat embeddings from the two different classes are well separated relative to two embeddings from\nthe same class, which is unlikely under F1,\u02dcn. Thus, the CDF of the F distribution offers a measure of\nthe separation between classes:\n\n(cid:18) s\n\n(cid:19)\n\nPr (S < s| \u00b51 = \u00b52, \u02dcn) = I\n\n,\n\n1\n2\n\n,\n\n\u02dcn\n2\n\ns + \u02dcn\n\n(1)\n\nwhere I is the regularized incomplete beta function, which is differentiable and thus can be incorpo-\nrated into an objective function for gradient-based training.\n\n2\n\n\fSeveral comments on this approach. First, although it assumes the two classes have equal variance,\nthe likelihood in Equation 1 is fairly robust against inequality of the variances as long as n1 \u2248 n2.1\nSecond, the F statistic can be computed for an arbitrary number of classes; the generalization of the\nlikelihood in Equation 1 is conditioned on all class instances being drawn from the same distribution.\nBecause this likelihood is a very weak indicator of class separation, we restrict our use of the F\nstatistic to class pairs. Third, this approach is based entirely on statistics of the training set, whereas\nevery other deep-embedding method of which we are aware uses training criteria that are based on\nindividual instances. For example, the triplet loss (Schroff et al., 2015) attempts to ensure that for\nspeci\ufb01c triplets {z11, z12, z21}, z11 is closer to z12 than to z21. Objectives based on speci\ufb01c instances\nwill be more susceptible to noise in the data set and may be more prone to over\ufb01tting.\n\n1.1 From one to many dimensions\n\nOur example in the previous section assumed one-dimensional embeddings. We have explored two\nextensions of the approach to many-dimensional embeddings. First, if we assume that the Euclidean\ndistances between embedded points are gamma distributed\u2014which turns out to be a good empirical\napproximation at any stage of training\u2014then we can represent the numerator and denominator in\nthe F statistic as sums of gamma random variables, and a variant of the unidimensional separation\nmeasure (Equation 1) can be used to assess separation based on Euclidean distances. Second, we\ncan apply the unidimensional separation measure for multiple dimensions of the many-dimensional\nembedding space. We adopt the latter approach because\u2014as we explain shortly\u2014it facilitates\ndisentangling.\nFor a given class pair (\u03b1, \u03b2), we compute\n\n\u03a6(\u03b1, \u03b2, k) \u2261 Pr (S < s | \u00b5\u03b1k = \u00b5\u03b2k, n\u03b1 + n\u03b2 \u2212 2)\n\nfor each dimension k of the embedding space. We select a set, D\u03b1,\u03b2, of the d dimensions with largest\n\u03a6(\u03b1, \u03b2, k), i.e., the dimensions that are best separated already. Although it is important to separate\nclasses, they needn\u2019t be separated on all dimensions because the pair may have semantic similarity or\nequivalence along some dimensions. The pair is separated if they can be distinguished reliably on a\nsubset of dimensions.\nFor a training set or a mini-batch with multiple instances of a set of classes C, our embedding\nobjective is to maximize the joint probability of separation for all class pairs (\u03b1, \u03b2) on all relevant\ndimensions, D\u03b1,\u03b2. Framed as a loss, we minimize the log probability:\n\nLF = \u2212(cid:80){\u03b1,\u03b2}\u2208C\n\n(cid:80)\n\nk\u2208D\u03b1,\u03b2\n\nln \u03a6(\u03b1, \u03b2, k)\n\nFigure 1.1 shows an illustration of the algorithm\u2019s behavior. We sample instances x\u03b11..N , x\u03b21..M\nfrom classes \u03b1 and \u03b2, and choose N and M such that N \u2248 M. The neural net encodes these instances\nas embeddings z\u03b11..N and z\u03b21..M , with dimensions k = 1..D. The variable \u03c6(\u03b1, \u03b2) indicates the\ndegree of separation for each dimension, where high values (darker) indicate better separation. In this\ncase, dimension 2 has the best separation, with low within-class and high between-class variance.\nThe algorithm maximizes the d largest values of \u03c6(\u03b1, \u03b2), and sets the loss for all other dimensions\nequal to zero.\nThis F -statistic loss has four desirable properties. First, the gradient rapidly drops to zero once classes\nbecome reliably separated on at least d dimensions, leading to a natural stopping criterion; the degree\nof separation obtained is related to the number of samples per class. Second, in contrast to other\nlosses, the F-statistic loss is not invariant to rotations in the embedding space; this focus on separating\nalong speci\ufb01c dimensions tends to yield disentangled features when the class structure is factorial or\ncompositional. Third, embeddings obtained are relatively insensitive to the one free parameter, d.\nFourth, because the loss is expressed in the currency of probability it can readily be combined with\nadditional losses expressed similarly (e.g., a reconstruction loss framed as a likelihood). The following\nsections demonstrate the advantages of the F -statistic loss for classi\ufb01cation and for disentangling\nattributes related to class identity.\n\n1For two classes, the F -statistic is equivalent to the square of a t-statistic. To address the potential issue of\nunequal variances, we explored replacing the F statistic with the Welch correction for a t statistic, but we found\nno improvement in model performance, and we prefer formulating the loss in terms of an F statistic due to its\ngreater generality.\n\n3\n\n\fFigure 1: Illustration of the behavior of the F -statistic loss\nfor a pair of classes in a minibatch. We sample instances\nx\u03b11..N , x\u03b21..M from classes \u03b1 and \u03b2. The neural net\nencodes these instances as embeddings z\u03b11..N and z\u03b21..M ,\nwith dimensions k = 1..D. The activations are indicated\nby the intensity of the blue color. The variable \u03c6(\u03b1, \u03b2)\nindicates the degree of separation for each dimension,\nwhere high values (darker circle) indicate better separation.\nIn this case, dimension 2 has the best separation, with\nlow within-class and high between-class variance. The\nalgorithm maximizes the d largest values of \u03c6(\u03b1, \u03b2), and\nsets the loss for all other dimensions equal to zero.\n\n2\n\nIdentity classi\ufb01cation\n\nIn this section, we demonstrate the performance of the F -statistic loss compared to state-of-the-art\ndeep-embedding losses on identity classi\ufb01cation. The \ufb01rst task involves matching a person from\na wide-angle, full-body photograph, taken at various angles and poses. For this task, we evaluate\nusing two datasets\u2014CUHK03 (Li et al., 2014) and Market-1501 (Zheng et al., 2015)\u2014following\nthe methodology of Ustinova & Lempitsky (2016). The second task involves matching a bird from a\nwide angle photograph; we evaluate performance on the CUB-200-2011 birds dataset (Wah et al.,\n2011). Five-fold cross validation is performed in every case. The \ufb01rst split is used to tune model\nhyper-parameters, and we report accuracy on the \ufb01nal four splits. This same procedure was used to\nevaluate the F -statistic loss and four competitors.\n\n2.1 Training details\n\nFor CUHK03 and Market-1501, we use the Deep Metric Learning (Yi et al., 2014b) architecture,\nfollowing Ustinova & Lempitsky (2016). For CUB-200-2011, we use an inception v3 (Szegedy\net al., 2016) network pretrained on ImageNet, and extract the 2048-dimensional features from the\n\ufb01nal pooling layer. We treat these features as constants, and optimize a fully connected net, with\n1024 hidden ReLU units. For every dataset, we use a 500-dimensional embedding. All nets were\ntrained using the ADAM (Kingma & Ba, 2014) optimizer, with a learning rate of 10\u22124 for all losses,\nexcept the F-statistic loss, which we found bene\ufb01tted from a slightly higher learning rate (2 \u00d7 10\u22124).\nFor each split, a validation set was withheld from the training set, and used for early stopping. To\nconstruct a mini-batch for training, we randomly select 12 identities, with up to 10 samples of\neach identity, as in Ustinova & Lempitsky (2016). In addition to the F -statistic loss, we evaluated\nhistogram (Ustinova & Lempitsky, 2016), triplet (Schroff et al., 2015), binomial deviance (Yi et al.,\n2014a), and lifted structured similarity softmax (LSSS) (Song et al., 2016) losses. For the triplet loss,\nwe use all triplets in the minibatch. For the histogram loss and binomial deviance losses, we use all\npairs. For the F -statistic loss, we use all class pairs. The triplet loss is trained and evaluated using L2\ndistances. The F -statistic loss is evaluated using L2 distances. As in Ustinova & Lempitsky (2016),\nembeddings obtained discovered by the histogram and binomial-deviance losses are constrained to\nlie on the unit hypersphere; cosine distance is used for training and evaluation. For the F -statistic\nloss, we determined the best value of d, the number of dimensions to separate, using the validation\nset of the \ufb01rst split. Performance is relatively insensitive to d for 2 < d < 100. For CUHK03 we\nchose d = 70, for Market-1501 d = 63, and for CUB-200 d = 3. For the triplet loss we found that\na margin of 0.1 worked well for all datasets. For binomial deviance and LSSS losses, we used the\nbest settings for each dataset as determined in Ustinova & Lempitsky (2016). Code for all models is\navailable at https://github.com/kridgeway/f-statistic-loss-nips-2018\n\n2.2 Results\n\nEmbedding procedures are typically evaluated with either recall@k or with a few-shot learning\nparadigm. The two evaluations are similar: using held-out classes, q instances of each class are\nprojected to the embedding space (the references) and performance is judged by the proximity of\na query instance to references in the embedding space. We evaluate with recall@1 or 1-nearest\nneighbor, which judges the query instance as correctly classi\ufb01ed if the closest reference is of the\n\n4\n\n!\"#!\"$!\"%!&#!&$!&%'(),+)k = 1 2 3 4-\"#-\"$-\"%-&#-&$-&%)+Neural netsamplesample\fLoss\n\nF-Statistic\nHistogram\n\nTriplet\n\nBinomial Deviance\n\nLSSS\n\nCUHK03\n\nMarket-1501\n\nCUB-200-2011\n90.17% \u00b1 0.44% 84.21% \u00b1 0.44% 55.22% \u00b1 0.75%\n86.07% \u00b1 0.73% 84.46% \u00b1 0.23% 58.89% \u00b1 0.89%\n81.18% \u00b1 0.61% 80.59% \u00b1 0.64% 45.09% \u00b1 0.80%\n85.37% \u00b1 0.45% 84.12% \u00b1 0.27% 59.05% \u00b1 0.73%\n85.75% \u00b1 0.62% 83.46% \u00b1 0.48% 54.68% \u00b1 0.49%\n\nFigure 2: Recall@1 results for our F -statistic loss and four competitors across three data sets. Shown\nis the percentage correct classi\ufb01cation and the standard error of the mean. The best algorithm(s) on a\ngiven data set are highlighted.\n\nsame class. This is equivalent to a q-shot learning evaluation; for our data sets, q ranged from 3 to 10.\n(For readers familiar with recall@k curves, we note that relative performance of algorithms generally\ndoes not vary with k, and k = 1 shows the largest differences.)\nTable 2 reports recall@1 accuracy. Overall, the F -statistic loss achieves accuracy comparable to the\nbest of its competitors, histogram and binomial deviance losses. It obtains the best result on CUHK03,\nties on Market-1501, and is a tier below the best on CUB-200. In earlier work (Anonymized Citation,\n2018), we conducted a battery of empirical tests comparing deep metric learning and few-shot\nlearning methods, and the histogram loss appears to be the most robust. Here, we have demonstrated\nthat our F -statistic loss matches this state-of-the-art in terms of producing domain embeddings that\ncluster instances by class. In the remainder of the paper, we argue that the F -statistic loss obtains\nsuperior disentangled embeddings.\n\n3 Quantifying disentanglement\n\nDisentangling is based on the premise that a set of underlying factors are responsible for generating\nobserved instances. The instances are typically high dimensional, redundant, and noisy, and each\nvector element depends on the value of multiple factors. The goal of a disentangling procedure is\nto recover the causal factors of an instance in a code vector. The term code is synonymous with\nembedding, but we prefer \u2018code\u2019 in this section to emphasize our focus on disentangling.\nThe notion of what constitutes an ideal code is somewhat up for debate, with most authors preferring\nto avoid explicit de\ufb01nitions, and others having con\ufb02icting notions (Higgins et al., 2017; Kim & Mnih,\n2017). The most explicit and comprehensive de\ufb01nition of disentangling (Eastwood & Williams,\n2018) is based on three criteria, which we refer to\u2014using a slight variant of their terminology\u2014as\nmodularity, compactness, and explicitness.2 In a modular representation, each dimension of the code\nconveys information about at most one factor. In a compact representation, a given factor is associated\nwith only one or a few code dimensions. In an explicit representation, there is a simple (e.g., linear)\nmapping from the code to the value of a factor. (See Supplementary Materials for further detail.)\nResearchers who have previously attempted to quantify disentangling have considered different\nsubsets of the modularity, compactness, and explicitness criteria. In Eastwood & Williams (2018),\nall three are included; in Kim & Mnih (2017), modularity and compactness are included, but\nnot explicitness; and in Higgins et al. (2017), modularity is included, but not compactness or\nexplicitness. We argue that modularity and explicitness should be considered as de\ufb01ning features of\ndisentangled representations, but not compactness. Although compactness facilitates interpretation\nof the representations, it has two signi\ufb01cant drawbacks. First, forcing compactness can affect the\nrepresentation\u2019s utility. Consider a factor \u03b8 \u2208 [0\u25e6, 360\u25e6] that determines the orientation of an object in\nan image. Encoding the orientation in two dimensions as (sin \u03b8, cos \u03b8) captures the natural similarity\nstructure of orientations, yet it is not compact relative to using \u03b8 as the code. Second, forcing a\nneural network to discover a minimal (compact) code may lead to local optima in training because\nthe solution space is highly constrained; allowing redundancy in the code enables many equivalent\nsolutions.\n\n2We developed our disentangling criteria and terminology in parallel with and independently of Eastwood\n& Williams (2018). We prefer our nomenclature and also our quanti\ufb01cation of the criteria because their\nquanti\ufb01cation requires determination of two hyperparameters (an L1 regularization penalty and a tree depth for a\nrandom forest). Nonetheless, it is encouraging that multiple research groups are converging on essentially the\nsame criteria.\n\n5\n\n\fIn order to evaluate disentangling performance of a deep-embedding procedure, we quantify modular-\nity and explicitness. For modularity, we start by estimating the mutual information between each\ncode dimension and each factor.3 If code dimension i is ideally modular, it will have high mutual\ninformation with a single factor and zero mutual information with all other factors. We use the\ndeviation from this idealized case to compute a modularity score. Given a single code dimension\ni and a factor f, we denote the mutual information between the code and factor by mif , mif \u2265 0.\nWe create a \u201ctemplate\u201d vector ti of the same size as mi, which represents the best-matching case of\nideal modularity for code dimension i:\n\n(cid:26)\u03b8i\n\n0\n\n\u03b4i =\n\ntif =\n\nif f = arg maxg(mig)\notherwise,\n\n(cid:80)\nf (mif \u2212 tif )2\ni (N \u2212 1)\n\u03b82\n\nwhere \u03b8i = maxg(mig). The observed deviation from the template is given by\n\n,\n\n(2)\n\nwhere N is the number of factors. A deviation of 0 indicates that we have achieved perfect modularity\nand 1 indicates that this dimension has equal mutual information with every factor. Thus, we use\n1 \u2212 \u03b4i as a modularity score for code dimension i and the mean of 1 \u2212 \u03b4i over i as the modularity\nscore for the overall code. Note that this expectation does not tell us if each factor is well represented\nin the code. To ascertain the coverage of the code, the explicitness measure is needed.\nUnder the assumption that factors have discrete values, we can compute an explicitness score for\neach value of each factor. In an explicit representation, recovering factor values from the code should\nbe possible with a simple classi\ufb01er. We have experimented with both RBF networks and logistic\nregression as recovery models, and have found logistic regression, with its implied linear separability,\nis a more robust procedure. We thus \ufb01t a one-versus-rest logistic-regression classi\ufb01er that takes the\nentire code as input. We record the ROC area-under-the-curve (AUC) of that classi\ufb01er. We quantify\nthe explicitness of a code using the mean of AUCjk over j, a factor index, and k, an index on values\nof factor j.\nIn the next section, we use this quanti\ufb01cation of modularity and explicitness to evaluate our F -statistic\nloss against other disentangling and deep-embedding methods.\n\n4 A weakly supervised approach to disentanglement\n\nPreviously proposed disentangling procedures lie at one of two extremes of supervision: either entirely\nunsupervised (Chen et al., 2016; Higgins et al., 2017), or requiring factor-aware oracles\u2014oracles that\nname a particular factor and provide sets of instances that either differ on all factors except the named\nfactor (Kulkarni et al., 2015) or are ordered by factor-speci\ufb01c similarity (Karaletsos et al., 2015; Veit\net al., 2016). The unsupervised procedures suffer from being underconstrained; the oracle-based\nprocedures require strong supervision.\nWe propose an oracle-based training procedure with an intermediate degree of supervision, inspired\nby the deep-embedding literature. We consider an oracle which chooses a factor and a set of instances,\nthen sorts the instances by their similarity on that factor, or into two groups\u2014identical and non-\nidentical. The oracle conveys the similarities but not the name of the factor itself. This scenario is\nlike the Sesame Street (children\u2019s TV show) game in which a set of objects are presented and one is\nnot like the other, and the child needs to determine along what dimension it differs. Sets of instances\nsegmented in this manner are easy to obtain via crowdsourcing: a worker is given a set of instances\nand simply told to sort them into two groups by similarity to one another, or to sort them by similarity\nto a reference. In either case, the sorting dimension is never explicitly speci\ufb01ed, and any nontrivial\ndomain will have many dimensions (factors) from which to choose. Our unnamed-factor oracle is a\ngeneralization of the procedure used for training deep embeddings, where the oracle judges similarity\nof instances by class label, without reference to the speci\ufb01c class label. Instead, our unnamed-factor\noracle operates by choosing a factor randomly and specifying similarity of instances by factor label,\nwithout reference to the speci\ufb01c factor.\n\n3In this work, we focus on the case of factors with discrete values and codes with continuous values. We\ndiscretize the code by constructing a 20-bin histogram of the code values with equal width bins, and then\ncomputing discrete mutual information between the factor-values and the code histogram.\n\n6\n\n\fWe explore two datasets in which each instance is tagged with values for several statistically inde-\npendent factors. Some of the factors are treated as class-related, and some as noise. First, we train\non a data set of video game sprites\u201460 \u00d7 60 pixel color images of game characters viewed from\nvarious angles and in a variety of poses (Reed et al., 2015). The identity of the game characters\nis composed of 7 factors\u2014body, arms, hair, gender, armor, greaves, and weapon\u2014each with 2\u20135\ndistinct values, leading to 672 total unique identities which can be instantiated in various viewing\nangles and poses. We also explore the small NORB dataset (LeCun et al., 2004). This dataset is\ncomposed of 96 \u00d7 96 pixel grayscale images of toys in various poses and lighting conditions. There\nare 5 superordinate categories, each with 10 subordinate categories, a total of 50 types of toys. Each\ntoy is imaged from 9 camera elevations and 18 azimuths, and under 6 lighting conditions. For our\nexperiments, we de\ufb01ne factors for toy type, elevation, and azimuth, and we treat lighting condition\nas a noise variable. For simplicity of evaluation, we partition the values of elevation and azimuth to\ncreate binary factors: grouping elevation into low (0 through 4) and high (5 through 8) buckets and\nazimuth values into right- (0 through 16) and left-(18 through 34) facing buckets, leading to a total of\n200 unique identities.\n\n4.1 Training Details\nFor the sprites dataset, we used the encoder architecture of Reed et al. (2015) as well as their embed-\nding dimensionality of 22. For small NORB, we use a convolutional network with 3 convolutional\nlayers and a \ufb01nal fully connected layer with an embedding dimensionality of 20. For the convolutional\nlayers, the \ufb01lter sizes are (7 \u00d7 7, 3 \u00d7 3, 3 \u00d7 3), the \ufb01lter counts are (48, 64, 72), and all use a stride\nof 2 and ReLU activation. For the F -statistic loss, we set the number of training dimensions d = 2.\nAgain, all nets were trained using the ADAM optimizer, with the same learning rates as used for the\nclassi\ufb01cation datasets.\nWe construct minibatches in a manner analogous to how we did for deep embeddings with class-based\ntraining (Section 3). For factor-based training, we select instances with similarity determined by\na single factor to construct a minibatch. For each epoch, we iterate through the factors until we\nhave trained on every instance with respect to every factor. Each minibatch is composed of up to 12\nfactor-values. For example, a minibatch focusing on the hair color factor of the sprites dataset will\ninclude samples of up to 12 hair colors, with multiple instances within each hair color. We train with\nup to 10 instances per factor-value for triplet and histogram. For the F -statistic loss, we found that\ntraining with up to 5 instances per factor-value helps avoid under\ufb01tting.\nFor both datasets, we evaluated with \ufb01ve-fold cross validation, using the conjunction of factors to\nsplit: the 7 factors for sprites and 3 (toy type, azimuth, and elevation) for norb. For each split, the\nvalidation set was used to determine when to stop training, based on mean factor explicitness. The\n\ufb01rst split was used to tune hyper-parameters, and the test sets of the remaining four splits are used to\nreport results. For these experiments, we compare the F -statistic loss to the triplet and histogram\nlosses; other losses using Lp norm or cosine distances should yield similar results. We also compare\nto the \u03b2-variational auto-encoder, or \u03b2-VAE (Higgins et al., 2017), an unsupervised disentangling\nmethod that has been shown to outperform other unsupervised methods such as InfoGAN (Chen\net al., 2016). The generator net in the \u03b2-VAE has the same number of layers as the encoder. The\nnumber of \ufb01lters and the size of the receptive \ufb01eld in the generator are mirror values of the encoder,\nsuch that the \ufb01rst layer in the encoder has the same number of output \ufb01lters that the last layer in the\ngenerator has as input. For the \u03b2-VAE, training proceeds until the reconstruction likelihood on the\nheld-out validation set stops improving.\n\n4.2 Results\nFigure 3 shows the modularity and explicitness scores for representations learned on the sprites and\nsmall NORB datasets (\ufb01rst and second rows, respectively) using triplet, histogram, and F -statistic\nlosses. Modularity scores appear in the \ufb01rst column; for modularity, we report the mean across\nvalidation splits and embedding dimensions. Explicitness scores appear in the second column; for\nexplicitness, we report the mean across validation splits and factor-values. (The sprites dataset has 7\nfactors and 22 total factor-values. The small NORB has a total of 3 factors and 54-factor values.) The\nF -statistic loss achieves the best modularity on both datasets, and the best explicitness on the small\nNORB dataset. On the Sprites dataset, all of the methods achieve good explicitness.\nFigure 4 compares modularity and explicitness of representations for the F -statistic and \u03b2-VAE, for\nvarious settings of \u03b2. The default setting of \u03b2=1 corresponds to the original VAE (Kingma & Welling,\n\n7\n\n\fFigure 3: Mean modularity and explicit-\nness scores for the triplet, histogram, and\nF -statistic losses on the small NORB and\nSprites datasets. The F -statistic loss dom-\ninates the other methods in three of the\ncomparisons, and although the F -statistic\nloss has a slight numerical advantage in\nSprites explicitness, the advantage is not\nstatistically reliable (comparing histogram\nto F -statistic with a paired t-test, p>.20).\nEssentially, all methods are at ceiling in\nSprites explicitness. Black bars indicate \u00b1\none standard error of the mean.\n\nFigure 4: Mean modularity and explicit-\nness scores for the F-statistic loss and\n\u03b2-VAE on the Sprites dataset. Black\nbars indicate \u00b1 one standard error of\nthe mean. The light and dark green\nbars correspond to the F -statistic loss\ntrained with an unnamed-factor oracle\nand a class-aware oracle, respectively.\nSee text for details.\n\n2013). As \u03b2 increases, modularity improves but explicitness worsens. This trade off has not been\npreviously reported and points to a limitation of the method. The \ufb01rst bar of each \ufb01gure corresponds\nto the F -statistic loss trained with the unnamed-factor oracle, and the second bar corresponds to\nthe F -statistic loss trained with a class-aware oracle. The class-aware oracle de\ufb01nes a class as a\nunique conjunction of the component factors (e.g., for small NORB the conjunction of object identity,\nazimuth, and elevation). It is thus a weaker form of supervision than the unnamed-factor oracle\nprovides, and is analogous to the type of training performed with deep-embedding procedures, where\nthe oracle indicates whether or not instances match on class without naming the class or its component\nfactors. Both F -statistic representations are superior to all variants of the \u03b2-VAE. The comparison is\nnot exactly fair because the \u03b2-VAE is unsupervised whereas the F -statistic loss is weakly supervised.\nNonetheless, the \u03b2-VAE is considered as a critical model for comparison, and we would have been\nremiss not to do so.\n\n5 Discussion and future work\n\nThe F -statistic loss is motivated by the goal of unifying the deep-embedding and disentangling\nliteratures. We have shown that it achieves state-of-the-art performance in the recall@1 task used\nto evaluate deep embeddings when trained with a class-aware oracle, and achieves state-of-the-art\nperformance in disentangling when trained with an unnamed-factor oracle. The ultimate goal of\nresearch in disentangling is to develop methods that work in a purely unsupervised fashion. The\n\u03b2-VAE is the leading contender in this regard, but we have shown a troubling trade off obtained with\nthe \u03b2-VAE through our quanti\ufb01cation of modularity and explicitness (Figure 4), and we have shown\nthat unsupervised training cannot at present compete with even weakly supervised training (not a\nsurprise to anyone). Another contribution of our work to disentangling is the notion of training with\nan unnamed-factor oracle or a class-aware oracle; in previous research with supervised disentangling,\nthe stronger factor-aware oracle was used which would indicate a factor name as well as judging\nsimilarity in terms of that factor. Our goal is to explore increasingly weaker forms of supervision.\nWe have taken the largest step so far in this regard through our examination of disentangling with a\nclass-aware oracle (Figure 4), which should serve as a reference for others interested in disentangling.\nOur current research focuses on methods for adaptively estimating d, the hyper-parameter governing\nthe number of dimensions trained on any trial. Presently, d determines the loss behavior for all pairs\n\n8\n\n0.750.800.850.900.951.00ModularityTripletHistogramF-statistic0.900.920.940.960.981.00Explicitness0.750.800.850.900.951.00Modularity0.900.920.940.960.981.00ExplicitnessSpritesSmall NORBModularityExplicitnessSpritesSmall NORBModularityExplicitnessSpritesSmall NORBModularityExplicitnessF-statistic (factor)F-statistic (class)-VAE, =1-VAE, =2-VAE, =4-VAE, =80.70.80.91.0ExplicitnessExplicitnessF-statistic (factor)F-statistic (class)-VAE, =1-VAE, =2-VAE, =4-VAE, =80.850.900.951.00ModularityModularity\fof classes, and must be tuned for each data set. Our hope is that we can adaptively estimate d for\neach pair of identities on the \ufb02y.\n\n6 Acknowledgements\n\nThis research was supported by the National Science Foundation awards EHR-1631428 and SES-\n1461535.\n\nReferences\nChen, Xi, Duan, Yan, Houthooft, Rein, Schulman, John, Sutskever, Ilya, and Abbeel, Pieter. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in Neural Information Processing Systems, pp. 2172\u20132180, 2016.\n\nChopra, S, Hadsell, R, and Y., LeCun. Learning a similiarty metric discriminatively, with application\nto face veri\ufb01cation. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,\npp. 349\u2013356, 2005.\n\nEastwood, Cian and Williams, Chris. A framework for the quantitative evaluation of disentangled\n\nrepresentations. ICLR, 2018.\n\nHiggins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher, Glorot, Xavier, Botvinick, Matthew,\nMohamed, Shakir, and Lerchner, Alexander. beta-vae: Learning basic visual concepts with a\nconstrained variational framework. ICLR, 2017.\n\nKaraletsos, Theofanis, Belongie, Serge, and R\u00e4tsch, Gunnar. Bayesian representation learning with\n\noracle constraints. ICLR, pp. 1\u20139, 2015. URL http://arxiv.org/abs/1506.05011.\n\nKim, Hyunjik and Mnih, Andriy. Disentangling by factorising. In Learning Disentangled Represen-\n\ntations: From Perception to Control Workshop, NIPS, 2017.\n\nKingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint,\n\n2014. URL http://arxiv.org/abs/1412.6980.\n\nKingma, Diederik P and Welling, Max. Auto-encoding variational bayes.\n\narXiv:1312.6114, 2013.\n\narXiv preprint\n\nKingma, Diederik P, Mohamed, Shakir, Rezende, Danilo Jimenez, and Welling, Max. Semi-\nsupervised learning with deep generative models. In Advances in Neural Information Processing\nSystems, pp. 3581\u20133589, 2014.\n\nKulkarni, Tejas D, Whitney, William F, Kohli, Pushmeet, and Tenenbaum, Josh. Deep convolutional\ninverse graphics network. In Advances in Neural Information Processing Systems, pp. 2539\u20132547,\n2015.\n\nLeCun, Yann, Huang, Fu Jie, and Bottou, Leon. Learning methods for generic object recognition\nwith invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR\n2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pp. II\u2013104. IEEE,\n2004.\n\nLi, Wei, Zhao, Rui, Xiao, Tong, and Wang, Xiaogang. Deepreid: Deep \ufb01lter pairing neural network\nfor person re-identi\ufb01cation. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pp. 152\u2013159, 2014.\n\nReed, Scott, Sohn, Kihyuk, Zhang, Yuting, and Lee, Honglak. Learning to disentangle factors of\nvariation with manifold interaction. Proceedings of the 31st International Conference on Machine\nLearning (ICML-14), pp. 1431\u20131439, 2014.\n\nReed, Scott E., Zhang, Yi, Zhang, Yuting, and Lee, Honglak. Deep visual analogy-making. Advances\nin Neural Information Processing Systems, pp. 1252\u20131260, 2015. ISSN 10495258. URL http:\n//papers.nips.cc/paper/5845-deep-visual-analogy-making.\n\n9\n\n\fSchroff, Florian, Kalenichenko, Dmitry, and Philbin, James. Facenet: A uni\ufb01ed embedding for\nface recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pp. 815\u2013823, 2015.\n\nSnell, Jake, Swersky, Kevin, and Zemel, Richard. Prototypical networks for few-shot learning. In\nLuxburg, U. V., Guyon, I., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S. V. N., and Garnett,\nR. (eds.), Advances in Neural Information Processing Systems 30, pp. xxx\u2013xxx. Curran Associates,\nInc., 2017.\n\nSong, Hyun Oh, Jegelka, Stefanie, and Savarese, Silvio. Deep metric learning via lifted structured\nfeature embedding query retrieval. Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pp. 4004\u20134012, 2016.\n\nSzegedy, Christian, Vanhoucke, Vincent, Ioffe, Sergey, Shlens, Jon, and Wojna, Zbigniew. Rethinking\nthe inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pp. 2818\u20132826, 2016.\n\nUstinova, Evgeniya and Lempitsky, Victor. Learning deep embeddings with histogram loss. Advances\n\nin Neural Information Processing Systems, pp. 4170\u20134178, 2016.\n\nVeit, Andreas, Belongie, Serge, and Karaletsos, Theofanis. Disentangling Nonlinear Perceptual\nEmbeddings With Multi-Query Triplet Networks. arXiv preprint, 2016. URL http://arxiv.\norg/abs/1603.07810.\n\nVinyals, Oriol, Blundell, Charles, Lillicrap, Tim, Wierstra, Daan, et al. Matching networks for one\n\nshot learning. In Advances in Neural Information Processing Systems, pp. 3630\u20133638, 2016.\n\nWah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The Caltech-UCSD Birds-200-2011\n\nDataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.\n\nYi, Dong, Lei, Zhen, and Li, Stan Z. Deep metric learning for practical person re-identi\ufb01cation.\nICPR, 11(4):1\u201311, 2014a. ISSN 10514651. doi: 10.1109/ICPR.2014.16. URL http://arxiv.\norg/abs/1407.4979.\n\nYi, Dong, Lei, Zhen, Liao, Shengcai, and Li, Stan Z. Deep metric learning for person re-identi\ufb01cation.\nIn Pattern Recognition (ICPR), 2014 22nd International Conference on, pp. 34\u201339. IEEE, 2014b.\n\nZheng, Liang, Shen, Liyue, Tian, Lu, Wang, Shengjin, Wang, Jingdong, and Tian, Qi. Scalable\nperson re-identi\ufb01cation: A benchmark. In Proceedings of the IEEE International Conference on\nComputer Vision, pp. 1116\u20131124, 2015.\n\n10\n\n\f", "award": [], "sourceid": 145, "authors": [{"given_name": "Karl", "family_name": "Ridgeway", "institution": "University of Colorado, Boulder"}, {"given_name": "Michael", "family_name": "Mozer", "institution": "Google Brain / U. Colorado"}]}