{"title": "Discriminative Clustering by Regularized Information Maximization", "book": "Advances in Neural Information Processing Systems", "page_first": 775, "page_last": 783, "abstract": "Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information-theoretic objective function which balances class separation, class balance and classifier complexity. The approach can flexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semi-supervised learning. In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method.", "full_text": "Discriminative Clustering by Regularized\n\nInformation Maximization\n\nRyan Gomes\n\ngomes@vision.caltech.edu\n\nAndreas Krause\n\nkrausea@caltech.edu\n\nperona@vision.caltech.edu\n\nCalifornia Institute of Technology\n\nPietro Perona\n\nPasadena, CA 91106\n\nAbstract\n\nIs there a principled way to learn a probabilistic discriminative classi\ufb01er from an\nunlabeled data set? We present a framework that simultaneously clusters the data\nand trains a discriminative classi\ufb01er. We call it Regularized Information Maxi-\nmization (RIM). RIM optimizes an intuitive information-theoretic objective func-\ntion which balances class separation, class balance and classi\ufb01er complexity. The\napproach can \ufb02exibly incorporate different likelihood functions, express prior as-\nsumptions about the relative size of different classes and incorporate partial labels\nfor semi-supervised learning. In particular, we instantiate the framework to un-\nsupervised, multi-class kernelized logistic regression. Our empirical evaluation\nindicates that RIM outperforms existing methods on several real data sets, and\ndemonstrates that RIM is an effective model selection method.\n\nIntroduction\n\n1\nClustering algorithms group data items into categories without requiring human supervision or def-\ninition of categories. They are often the \ufb01rst tool used when exploring new data. A great number\nof clustering principles have been proposed, most of which can be described as either generative\nor discriminative in nature. Generative clustering algorithms provide constructive de\ufb01nitions of\ncategories in terms of their geometric properties in a feature space or as statistical processes for\ngenerating data. Examples include k-means and Gaussian mixture model clustering. In order for\ngenerative clustering to be practical, restrictive assumptions must be made about the underlying\ncategory de\ufb01nitions.\nRather than modeling categories explicitly, discriminative clustering techniques represent the\nboundaries or distinctions between categories. Fewer assumptions about the nature of categories\nare made, making these methods powerful and \ufb02exible in real world applications. Spectral graph\npartitioning [1] and maximum margin clustering [2] are example discriminative clustering methods.\nA disadvantage of existing discriminative approaches is that they lack a probabilistic foundation,\nmaking them potentially unsuitable in applications that require reasoning under uncertainty or in\ndata exploration.\nWe propose a principled probabilistic approach to discriminative clustering, by formalizing the\nproblem as unsupervised learning of a conditional probabilistic model. We generalize the work of\nGrandvalet and Bengio [3] and Bridle et al. [4] in order to learn probabilistic classi\ufb01ers that are\nappropriate for multi-class discriminative clustering, as explained in Section 2. We identify two\nfundamental, competing quantities, class balance and class separation, and develop an information\ntheoretic objective function which trades off these quantities. Our approach corresponds to\nmaximizing mutual information between the empirical distribution on the inputs and the induced\n\n1\n\n\flabel distribution, regularized by a complexity penalty. Thus, we call our approach Regularized\nInformation Maximization (RIM).\nIn summary, our contribution is RIM, a probabilistic framework for discriminative clustering with\na number of attractive properties. Thanks to its probabilistic formulation, RIM is \ufb02exible:\nit is\ncompatible with diverse likelihood functions and allows speci\ufb01cation of prior assumptions about\nexpected class proportions. We show how our approach leads to an ef\ufb01cient, scalable optimization\nprocedure that also provides a means of automatic model selection (determination of the number\nof clusters). RIM is easily extended to semi-supervised classi\ufb01cation. Finally, we show that RIM\nperforms better than competing approaches on several real-world data sets.\n2 Regularized Information Maximization\nSuppose we are given an unlabeled dataset of N feature vectors (datapoints) X = (x1,\u00b7\u00b7\u00b7 , xN ),\nwhere xi = (xi1, . . . , xiD)T \u2208 RD are D-dimensional vectors with components xid. Our goal is\nto learn a conditional model p(y|x, W) with parameters W which predicts a distribution over label\nvalues y \u2208 {1, . . . , K} given an input vector x.\nOur approach is to construct a functional F (p(y|x, W); X, \u03bb) which evaluates the suitability of\np(y|x, W) as a discriminative clustering model. We then use standard discriminative classi\ufb01ers\nsuch as logistic regression for p(y|x, W), and maximize the resulting function F (W; X, \u03bb) over\nthe parameters W. \u03bb is an additional tuning parameter that is \ufb01xed during optimization.\nWe are guided by three principles when constructing F (p(y|x, W); X, \u03bb). The \ufb01rst is that the dis-\ncriminative model\u2019s decision boundaries should not be located in regions of the input space that are\nP\ndensely populated with datapoints. This is often termed the cluster assumption [5], and also corre-\nsponds to the idea that datapoints should be classi\ufb01ed with large margin. Grandvalet & Bengio [3]\nshow that a conditional entropy term \u2212 1\ni H{p(y|xi, W)} very effectively captures the cluster\nassumption when training probabilistic classi\ufb01ers with partial labels. However, in the case of fully\nunsupervised learning this term alone is not enough to ensure sensible solutions, because conditional\nentropy may be reduced by simply removing decision boundaries and unlabeled categories tend to\nbe removed. We illustrate this in Figure 1 (left) with an example using the multilogit regression\nclassi\ufb01er as the conditional model p(y|x, W), which we will develop in Section 3.\nIn order to avoid degenerate solutions, we incorporate the notion of class balance: we prefer con-\n\ufb01gurations in which category labels are assigned evenly across the dataset. We de\ufb01ne the empirical\nlabel distribution\n\nN\n\nZ\n\n\u02c6p(y; W) =\n\n\u02c6p(x)p(y|x, W)dx =\n\np(y|xi, W),\n\nX\n\ni\n\n1\nN\n\nX\n\ni\n\nwhich is an estimate of the marginal distribution of y. A natural way to encode our preference\ntowards class balance is to use the entropy H{\u02c6p(y; W)}, because it is maximized when the labels\nare uniformly distributed. Combining the two terms, we arrive at\n\nIW{y; x}= H{\u02c6p(y; W)}\u2212 1\nN\n\nH{p(y|xi, W)}\n\n(1)\n\nwhich is the empirical estimate of the mutual information between x and y under the conditional\nmodel p(y|x, W).\nBridle et al. [4] were the \ufb01rst to propose maximizing IW{y; x} in order to learn probabilistic classi-\n\ufb01ers without supervision. However, they note that IW{y; x} may be trivially maximized by a con-\nditional model that classi\ufb01es each data point xi into its own category yi, and that classi\ufb01ers trained\nwith this objective tend to fragment the data into a large number of categories, see Figure 1 (center).\nWe therefore introduce a regularizing term R(W; \u03bb) whose form will depend on the speci\ufb01c choice\nof p(y|x, W). This term penalizes conditional models with complex decision boundaries in order\nto yield sensible clustering solutions. Our objective function is\n\nF (W; X, \u03bb) = IW{y; x} \u2212 R(W; \u03bb)\n\n(2)\nand we therefore refer to our approach as Regularized Information Maximization (RIM), see Figure 1\n(right). While we motivated this objective with notions of class balance and seperation, our approach\nmay be interpreted as learning a conditional distribution for y that preserves information from the\ndata set, subject to a complexity penalty.\n\n2\n\n\fGrandvalet & Bengio [3]\n\nBridle et al. [4]\n\nRIM\n\ns\nn\no\ni\ng\ne\nR\nn\no\ni\ns\ni\nc\ne\nD\n\ny\np\no\nr\nt\nn\nE\n\n.\n\nd\nn\no\nC\n\nFigure 1: Example unsupervised multilogit regression solutions on a simple dataset with three clus-\nters. The top and bottom rows show the category label arg maxy p(y|x, W) and conditional entropy\nH{p(y|x, W)} at each point x, respectively. We \ufb01nd that both class balance and regularization\nterms are necessary to learn unsupervised classi\ufb01ers suitable for multi-class clustering.\n\n3 Example application: Unsupervised Multilogit Regression\nThe RIM framework is \ufb02exible in the choice of p(y | x; W) and R(W; \u03bb). As an example instan-\ntiation, we here choose multiclass logistic regression as the conditional model. Speci\ufb01cally, if K is\nthe maximum number of classes, we choose\np(y = k|x, W) \u221d exp(wT\n\n(3)\nwhere the set of parameters W = {w1, . . . , wK; b1, . . . , bK} consists of weight vectors wk and\nbias values bk for each class k. Each weight vector wk \u2208 RD is D-dimensional with components\nwkd. The regularizer is the squared L2 norm of the weight vectors, and may be interpreted as an\nisotropic normal distribution prior on the weights W. The bias terms are not penalized.\nIn order to optimize Eq. 2 specialized with Eqs. 3, we require the gradients of the objective function.\nFor clarity, we de\ufb01ne pki \u2261 p(y = k|xi, W), and \u02c6pk \u2261 \u02c6p(y = k; W). The partial derivatives are\n\nk x + bk) and R(W; \u03bb) = \u03bb\n\nX\n\nk wk,\n\nwT\n\nk\n\n\u2202F\n\u2202wkd\n\n=\n\n1\nN\n\n\u2202pci\n\u2202wkd\n\nlog pci\n\u02c6pc\n\n\u2212 2\u03bbwkd and \u2202F\n\u2202bk\n\n=\n\n1\nN\n\n\u2202pci\n\u2202bk\n\nlog pci\n\u02c6pc\n\n.\n\n(4)\n\nX\n\nic\n\nX\n\nic\n\nNaive computation of the gradient requires O(N K 2D), since there are K(D + 1) parameters and\neach derivative requires a sum over N K terms. However, the form of the conditional probability\nderivatives for multi-logit regression are:\n\n\u2202pci\n\u2202wkd\n\n= (\u03b4kc \u2212 pci)pkixid and \u2202pci\n\u2202bk\n\n= (\u03b4kc \u2212 pci)pki,\n\n(cid:16)\n\nwhere \u03b4kc is equal to one when indices k and c are equal, and zero otherwise. When these expres-\nsions are substituted into Eq. 4, we \ufb01nd the following expressions:\npci log pci\n\u02c6pc\n\n(cid:17) \u2212 2\u03bbwkd\nComputing the gradient requires only O(N KD) operations since the termsP\n\n\u2212X\n\u2212X\n\nX\nX\n\npci log pci\n\u02c6pc\n\nlog pki\n\u02c6pk\n\nlog pki\n\u02c6pk\n\n1\nN\n1\nN\n\n\u2202F\n\u2202wkd\n\n\u2202F\n\u2202bk\n\nxidpki\n\n(cid:16)\n\n(cid:17)\n\ni\n\ni\n\n=\n\n=\n\nc\n\nc\n\n(5)\n\npki\n\nc pci log pci\n\u02c6pc\n\nmay be\n\ncomputed once and reused in each partial derivative expression.\nThe above gradients are used in the L-BFGS [6] quasi-Newton optimization algorithm1. We \ufb01nd em-\npirically that the optimization usually converges within a few hundred iterations. When specialized\n1We used Mark Schmidt\u2019s implementation at http://www.cs.ubc.ca/\u223cschmidtm/Software/\n\nminFunc.html.\n\n3\n\nx1x2\u22122\u22121012\u22122\u22121012x1x2\u22122\u22121012\u22122\u22121012x1x2\u22122\u22121012\u22122\u22121012x1x2  \u22122\u22121012\u22122\u221210120.10.20.30.40.50.6x1x2  \u22122\u22121012\u22122\u2212101200.20.40.60.8x1x2  \u22122\u22121012\u22122\u221210120.20.40.60.81\fFigure 2: Demonstration of model selection on the toy problem from Figure 1. The algorithm is\ninitialized with 50 category weight vectors wk. Upon convergence, only three of the categories\nare populated with data examples. The negative bias terms of the unpopulated categories drive the\nunpopulated class probabilities \u02c6pk towards zero. The corresponding weight vectors wk have norms\nnear zero.\n\nto multilogit regression, the objective function F (W; x, \u03bb) is non-concave. Therefore the algorithm\ncan only be guaranteed to halt at locally optimal stationary points of F . In Section 3.1, we explain\nhow we can obtain an initialization that is robust against local optima.\n3.1 Model Selection\nSetting the derivatives (Eq. 5) equal to zero yields the following condition at stationary points of F :\n\nwk =X\n(cid:16)\n\nlog pki\n\u02c6pk\n\ni\n\n\u03b10\nkixi\n\n\u2212X\n\nc\n\n(cid:17)\n\n.\n\npci log pci\n\u02c6pc\n\n(6)\n\n(7)\n\n(cid:16)X\n\ni\n\n4\n\nwhere we have de\ufb01ned\n\nki \u2261 1\n\u03b10\n2\u03bbN\n\npki\n\nk wk =P\n\nkjxT\n\nij \u03b10\n\nki\u03b10\n\nkjxT\n\nij \u03b10\n\nP\nk wk = P\ni pki \u2192 0. This implies that pki \u2192 0 for all i, and therefore \u03b10\n\nThe L2 regularizing function R(W; \u03bb) in Eq. 3 is additively composed of penalty terms associated\ni xj. It is instructive to observe the limiting behavior\nwith each category: wT\nk wk when datapoints are not assigned to category k; that is, when \u02c6pk =\nof the penalty term wT\nki \u2192 0 for all i. Finally,\n1\ni xj \u2192 0. This means that the regularizing function does not penalize\nN\nki\u03b10\nwT\nunpopulated categories.\nWe \ufb01nd empirically that when we initialize with a large number of category weights wk, many de-\ncay away depending on the value of \u03bb. Typically as \u03bb increases, fewer categories are discovered.\nThis may be viewed as model selection (automatic determination of the number of categories) since\nthe regularizing function and parameter \u03bb may be interpreted as a form of prior on the weight pa-\nrameters. The bias terms bk are unpenalized and are adjusted during optimization to drive the class\nprobablities \u02c6pk arbitrarily close to zero for unpopulated classes. This is illustrated in Figure 2.\nThis behavior suggests an effective initialization procedure for our algorithm. We \ufb01rst oversegment\nthe data into a large number of clusters (using k-means or other suitable algorithm) and train a\nsupervised multi-logit classi\ufb01er using these cluster labels. (This initial classi\ufb01er may be trained with\na small number of L-BFGS iterations since it only serves as a starting point.) We then use this\nclassi\ufb01er as the starting point for our RIM algorithm and optimize with different values of \u03bb in order\nto obtain solutions with different numbers of clusters.\n4 Example Application: Unsupervised Kernel Multilogit Regression\nThe stationary conditions have another interesting consequence. Equation 6 indicates that at sta-\ntionary points, the weights are located in the span of the input datapoints. We use this insight as\ni \u03b1kixi during\noptimization. Substituting this equation into the multilogit regression conditional likelihood allows\ni \u03b1kiK(xi, x), where K is a positive de\ufb01nite kernel\nreplacement of all inner products wT\nfunction that evaluates the inner product xT\n\njusti\ufb01cation to de\ufb01ne explicit coef\ufb01cients \u03b1ki and enforce the constraint wk = P\n\nk x withP\n\ni x. The conditional model now has the form\n\np(y = k|x, \u03b1, b) \u221d exp\n\n\u03b1kiK(xi, x) + bk\n\n.\n\n(cid:17)\n\n0204000.20.40.6Class ProbabilitiesClass Index02040\u221230\u221220\u22121001020BiasClass Indexbk02040051015Weight Vector NormswTk wkClass Index\fSubstituting the constraint into the regularizing functionP\nk wk by the Reproducing Hilbert Space (RKHS) norm of the functionP\n\nk wT\n\nwT\n\nk wk yields a natural replacement of\n\ni \u03b1kiK(xi,\u00b7):\n\n\u03b1ki\u03b1kjK(xi, xj).\n\n(8)\n\nR(\u03b1) =X\n(cid:16)\n\nk\n\nX\n\u2212X\n\nij\n\nc\n\nX\n\ni\n\n\u2202F\n\u2202\u03b1kj\n\n=\n\n1\nN\n\nWe use the L-BFGS algorithm to optimize the kernelized algorithm over the coef\ufb01cients \u03b1ki and\nbiases bk. The partial derivatives for the kernel coef\ufb01cients are\n\nK(xj, xi)pki\n\nlog pki\n\u02c6pk\n\npci log pci\n\u02c6pc\n\n\u03b1kiK(xj, xi)\n\n(cid:17) \u2212 2\u03bb\n\nX\n\ni\n\nand the derivatives for the biases are unchanged. The gradient of the kernelized algorithm requires\nO(KN 2) to compute. Kernelized unsupervised multilogit regression exhibits the same model\nselection behavior as the linear algorithm.\n5 Extensions\nWe now discuss how RIM can be extended to semi-supervised classi\ufb01cation, and to encode prior\nassumptions about class proportions.\n5.1 Semi-supervised Classi\ufb01cation\n\n1 ,\u00b7\u00b7\u00b7 , xU\n\nthere are unlabeled examples XU =\nIn semi-supervised classi\ufb01cation, we assume that\n{xU\nM} with labels Y = {y1,\u00b7\u00b7\u00b7 , yM}.\nN} as well as labeled examples XL = {xL\nWe again use mutual information IW{y; x} (Eq. 1) to de\ufb01ne the relationship between unlabeled\npoints and the model parameters, but we incorporate an additional parameter \u03c4 which will de\ufb01ne\nthe tradeoff between labeled and unlabeled examples. The conditional likelihood is incorporated for\nlabeled examples to yield the semi-supervised objective:\n\nS(W; \u03c4, \u03bb) =\u03c4 IW{y; x} \u2212 R(W; \u03bb) +X\n\nlog p(yi|xL\n\n1 ,\u00b7\u00b7\u00b7 , xL\n\ni , W)\n\nThe gradient is computed and again used in the L-BFGS algorithm in order to optimize this com-\nbined objective. Our approach is related to the objective in [3], which does not contain the class\nbalance term H(\u02c6p(y; W)).\n5.2 Encoding Prior Beliefs about the Label Distribution\n\ni\n\nX\n\nSo far, we have motivated our choice for the objective function F through the notion of class balance.\nHowever, in many classi\ufb01cation tasks, different classes have different number of members. In the\nfollowing, we show how RIM allows \ufb02exible expression of prior assumptions about non-uniform\nclass label proportions.\nFirst, note that the following basic identity holds\n\n(9)\nwhere U is the uniform distribution over the set of labels {1,\u00b7\u00b7\u00b7 , K}. Substituting the identity, then\ndropping the constant log(K) yields another interpretation of the objective\n\nH{\u02c6p(y; W)} = log(K) \u2212 KL{\u02c6p(y; W)||U}\n\ni\n\nF (W; X, \u03bb) = \u2212 1\nN\n\nH{p(y|xi, W)} \u2212 KL{\u02c6p(y; W)||U} \u2212 R(W; \u03bb).\n\n(10)\nThe term \u2212KL{\u02c6p(y; W)||U} is maximized when the average label distribution is uniform. We\ncan capture prior beliefs about the average label distribution by substituting a reference distribution\nD(y; \u03b3) in place of U (\u03b3 is a parameter that may be \ufb01xed or optimized during learning). [7] also use\nrelative entropy as a means of enforcing prior beliefs, although not with respect to class distributions\nin multi-class classi\ufb01cation problems.\nThis construction may be used in a clustering task in which we believe that the cluster sizes obey\na power law distribution as, for example, considered by [8] who use the Pitman-Yor process for\nnonparametric language modeling. Simple manipulation yields the following objective:\n\nwhere H{\u02c6p(y; W)||D(y; \u03b3)} is the cross entropy \u2212P\n\nF (W; X, \u03bb, \u03b3) = IW{x; y} \u2212 H{\u02c6p(y; W)||D(y; \u03b3)} \u2212 R(W; \u03bb)\n\nk \u02c6p(y = k; W) log D(y = k; \u03b3). We there-\nfore \ufb01nd that label distribution priors may be incorporated using an additional cross entropy regu-\nlarization term.\n\n5\n\n\fFigure 3: Unsupervised Clustering: Adjusted Rand Index (relative to ground truth) versus number\nof clusters.\n\n6 Experiments\nWe empirically evaluate our RIM approach on several real data sets, in both fully unsupervised and\nsemisupervised con\ufb01gurations.\n6.1 Unsupervised Learning\nKernelized RIM is initialized according to the procedure outlined in Section 3.1, and run until L-\nBFGS converges. Unlabeled examples are then clustered according to arg maxk p(y = k|x, W).\nWe compare RIM against the spectral clustering (SC) algorithm of [1], the fast maximum margin\nclustering (MMC) algorithm of [9], and kernelized k-means [10]. MMC is a binary clustering algo-\nrithm. We use the recursive scheme outlined by [9] to extend the approach to multiple categories.\nThe MMC algorithm requires an initial clustering estimate for initialization, and we use SC to pro-\nvide this.\nWe evaluate unsupervised clustering performance in terms of how well the discovered clusters re\ufb02ect\nknown ground truth labels of the dataset. We report the Adjusted Rand Index (ARI) [11] between an\ninferred clustering and the ground truth categories. ARI has a maximum value of 1 when two clus-\nterings are identical. We evaluated a number of other measures for comparing clusterings to ground\ntruth including mutual information, normalized mutual information [12], and cluster impurity [13].\nWe found that the relative rankings of the algorithms were the same as indicated by ARI.\nWe evaluate the performance of each algorithm while varying the number of clusters that are dis-\ncovered, and we plot ARI for each setting. For SC and k-means the number of clusters is given as\nan input parameter. MMC is evaluated at {2, 4, 8,\u00b7\u00b7\u00b7} clusters (powers of two, due to the recursive\nscheme.) For RIM, we sweep the regularization parameter \u03bb and allow the algorithm to discover the\n\ufb01nal number of clusters.\nImage Clustering. We test the algorithms on an image clustering task with 350 images from four\nCaltech-256 [14] categories (Faces-Easy, Motorbikes, Airplanes, T-Shirt) for a total of N = 1400\nimages. We use the Spatial Pyramid Match kernel [15] computed between every pair of images.\nWe sweep RIM\u2019s \u03bb parameter across [ 0.125\nN ]. The results are summarized in \ufb01gure 3. Overall,\nthe clusterings that best match ground truth are given by RIM when it discovers four clusters. We\n\ufb01nd that RIM outperforms both SC and MMC at all settings. RIM outperforms kernelized k-means\nwhen discovering between 4 and 8 clusters. Their performances are comparable for other numbers\nof clusters. Figure 4 shows example images taken from clusters discovered by RIM. Our RIM\nimplementation takes approximately 110 seconds per run on the Caltech Images datset on a quad\ncore Intel Xeon server. SC requires 38 seconds per run, while MMC requires 44-51 seconds per run\ndepending on the number of clusters speci\ufb01ed.\n\nN , 4\n\nMolecular Graph Clustering. We further test RIM\u2019s unsupervised learning performance on two\nmolecular graph datasets. D&D [16] contains N = 1178 protein structure graphs with binary\nground truth labels indicating whether or not they function as enzymes. NCI109 [17] is composed\nof N = 4127 compounds labeled according to whether or not they are active in an anti-cancer\nscreening. We use the subtree kernel developed by [18] with subtree height of 1. For D&D, we\nN ] and for NCI we sweep through the\nsweep RIM\u2019s lambda parameter through the range [ 0.001\ninterval [ 0.001\nN ]. Results are summarized in Figures 3 (center and right). We \ufb01nd that of all\nmethods, RIM produces the clusterings that are nearest to ground truth (when discovering 2 clusters\n\nN , 0.05\n\nN , 1\n\n6\n\n2468101214160.20.30.40.50.60.70.80.91# of clustersAdjusted Rand IndexCaltech ImagesMMCk\u2212meansRIMSC2468101214161820\u22120.0500.050.10.150.2# of clustersAdjusted Rand IndexDandD GraphsRIMk\u2212meansMMCSC24681012141618202200.010.020.030.040.050.06# of clustersAdjusted Rand IndexNCI109 GraphsRIMk\u2212meansSCMMC\fC1\n\nC2\n\nC3\nC4\nC5\n\nFigure 4: Left: Randomly chosen example images from clusters discovered by unsupervised RIM\non Caltech Image. Right: Semi-supervised learning on Caltech Images.\n\nAverage Waveform\n\nMost Uncertain Waveform\n\nFigure 5: Left, Tetrode dataset average waveform. Right, the waveform with the most uncertain\ncluster membership according to the classi\ufb01er learned by RIM.\n\nfor D&D and 5 clusters for NCI109). RIM outperforms both SC and MMC at all settings. RIM has\nthe advantage over k-means when discovering a small number of clusters and is comparable at other\nsettings. On NCI109, RIM required approximately 10 minutes per run. SC required approximately\n13 minutes, while MMC required on average 18 minutes per run.\n\nNeural Tetrode Recordings. We demonstrate RIM on a large scale data set of 319, 209 neural\nactivity waveforms recorded from four co-located electrodes implanted in the hippocampus of a\nbehaving rat. The waveforms are composed of 38 samples from each of the four electrodes and are\nthe output of a neural spike detector which aligns signal peaks to the 13-th sample, see the average\nwaveform in Figure 5 (left). We concatenate the samples into a single 152-dimensional vector and\npreprocess by subtracting the mean waveform and divide each vector component by its variance.\nWe use the linear RIM algorithm given in Section 3, initialized with 100 categories. We set \u03bb to 4\nN\nand RIM discovers 33 clusters and \ufb01nishes in 12 minutes. There is no ground truth available for this\ndataset, but we use it to demonstrate RIM\u2019s ef\ufb01cacy as a data exploration tool. Figure 6 shows two\nclusters discovered by RIM. The top row consists of cluster member waveforms superimposed on\neach other, with the cluster\u2019s mean waveform plotted in red. We \ufb01nd that the clustered waveforms\nhave substantial similarity to each other. Taken as a whole, the clusters give an idea of the typical\nwaveform patterns. The bottom row shows the learned classi\ufb01er\u2019s discriminative weights wk for\neach category, which can be used to gain a sense for how the cluster\u2019s members differ from the\ndataset mean waveform. We can use the probabilistic classi\ufb01er learned by RIM to discover atypical\nwaveforms by ranking them according to their conditional entropy H{p(y|xi, W)}. Figure 5 (right)\nshows the waveform whose cluster membership is most uncertain.\n\nCluster 1\n\nCluster 2\n\ne\nv\na\nW\n\n.\ns\nt\n\nW\n\nFigure 6: Two clusters discovered by RIM on the Tetrode data set. Top row: Superimposed\nwaveform members, with cluster mean in red. Bottom row: The discriminative category weights\nwk associated with each cluster.\n\n7\n\n0501001500.650.70.750.80.850.90.95Number of labeled examplesTest AccuracyClassification PerformanceGrandvalet & BengioSupervisedRIM\f6.2 Semi-supervised Classi\ufb01cation\nWe test our semi-supervised classi\ufb01cation method described in Section 5.1 against [3] on the Cal-\ntech Images dataset. The methods were trained using both unlabeled and labeled examples, and\nclassi\ufb01cation performance is assessed on the unlabeled portion. As a baseline, a supervised classi-\n\ufb01er was trained on labeled subsets of the data and tested on the remainder. Parameters were selected\nvia cross-validation on a subset of the labeled examples. The results are summarized in Figure 4.\nWe \ufb01nd that both semi-supervised methods signi\ufb01cantly improve classi\ufb01cation performance rela-\ntive to the supervised baseline when the number of labeled examples is small. Additionally, we\n\ufb01nd that RIM outperforms Grandvalet & Bengio. This suggests that incorporating prior knowledge\nabout class size distributions (in this case, we use a uniform prior) may be useful in semi-supervised\nlearning.\n7 Related Work\nOur work has connections to existing work in both unsupervised learning and semi-supervised clas-\nsi\ufb01cation.\nUnsupervised Learning. The information bottleneck method [19] learns a conditional model\np(y|x) where the labels y form a lossy representation of the input space x, while preserving in-\nformation about a third \u201crelevance\u201d variable z. The method maximizes I(y; z) \u2212 \u03bbI(x; y), whereas\nwe maximize the information between y and x while constraining complexity with a parametric\nregularizer. The method of [20] aims to maximize a similarity measure computed between members\nwithin the same cluster while penalizing the mutual information between the cluster label y and the\ninput x. Again, mutual information is used to enforce a lossy representation of y|x. Song et al. [22]\nalso view clustering as maximization of the dependence between the input variable and output la-\nbel variable. They use the Hilbert-Schmidt Independence Criterion as a measure of dependence,\nwhereas we use Mutual Information.\nThere is also an unsupervised variant of the Support Vector Machine, called max-margin cluster-\ning. Like our approach, the works of [2] and [21] use notions of class balance, seperation, and\nregularization to learn unsupervised discriminative classi\ufb01ers. However, they are formulated in the\nmax-margin framework rather than our probabilistic approach. Ours appears more amenable to\nincorporating prior beliefs about the class labels. Unsupervised SVMs are solutions to a convex\nrelaxation of a non-convex problem, while we directly optimize our non-convex objective. The\nsemide\ufb01nite programming methods required are much more expensive than our approach.\nSemi-supervised Classi\ufb01cation. Our semi-supervised objective is related to [3], as discussed in\nsection 5.1. Another semi-supervised method [23] uses mutual information as a regularizing term to\nbe minimized, in contrast to ours which attempts to maximize mutual information. The assumption\nunderlying [23] is that any information between the label variable and unlabeled examples is an\nartifact of the classi\ufb01er and should be removed. Our method encodes the opposite assumption:\nthere may be variability (e.g. new class label values) not captured by the labeled data, since it is\nincomplete.\n8 Conclusions\nWe considered the problem of learning a probabilistic discriminative classi\ufb01er from an unlabeled\ndata set. We presented Regularized Information Maximization (RIM), a probabilistic framework\nfor tackling this challenge. Our approach consists of optimizing an intuitive information theoretic\nobjective function that incorporates class separation, class balance and classi\ufb01er complexity, which\nmay be interpreted as maximizing the mutual information between the empirical input and implied\nlabel distributions. The approach is \ufb02exible, in that it allows consideration of different likelihood\nfunctions. It also naturally allows expression of prior assumptions about expected label proportions\nby means of a cross-entropy with respect to a reference distribution. Our framework allows\nnatural incorporation of partial labels for semi-supervised learning. In particular, we instantiate the\nframework to unsupervised, multi-class kernelized logistic regression. Our empirical evaluation\nindicates that RIM outperforms existing methods on several real data sets, and demonstrates that\nRIM is an effective model selection method.\nAcknowledgements\nWe thank Alex Smola for helpful comments and discussion, and Thanos Siapas for providing the neural tetrode\ndata. This research was partially supported by NSF grant IIS-0953413, a gift from Microsoft Corporation, and\nONR MURI Grant N00014-06-1-0734.\n\n8\n\n\fReferences\n[1] A. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In NIPS,\n\n2001.\n\n[2] L. Xu and D. Schuurmans. Unsupervised and semi-supervised multi-class support vector ma-\n\nchines. In AAAI, 2005.\n\n[3] Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In NIPS,\n\n2004.\n\n[4] John S. Bridle, Anthony J. R. Heading, and David J. C. MacKay. Unsupervised classi\ufb01ers,\nmutual information and \u2018phantom targets\u2019. In John E. Moody, Steve J. Hanson, and Richard P.\nLippmann, editors, Advances in Neural Information Processing Systems, volume 4, pages\n1096\u20131101. Morgan Kaufmann Publishers, Inc., 1992.\n\n[5] Olivier Chapelle and Alexander Zien. Semi-supervised classi\ufb01cation by low density separa-\n\ntion, September 2004.\n\n[6] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization.\n\nMathematical Programming, 45:503\u2013528, 1989.\n\n[7] T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination. In NIPS, 1999.\n[8] Y. W. Teh. A hierarchical bayesian language model based on pitman-yor processes. In ACL,\n\n2006.\n\n[9] K. Zhang, I. W. Tsang, and J. T. Kwok. Maximum margin clustering made practical. In ICML,\n\n2007.\n\n[10] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge\n\nUniversity Press, New York, NY, USA, 2004.\n\n[11] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classi\ufb01cation, 2:193\u2013\n\n218, 1985.\n\n[12] Alexander Strehl and Joydeep Ghosh. Cluster ensembles \u2014 A knowledge reuse framework for\n\ncombining multiple partitions. Journal of Machine Learning Research, 3:583\u2013617, 2002.\n\n[13] Y. Chen, J. Ze Wang, and R. Krovetz. CLUE: cluster-based retrieval of images by unsupervised\n\nlearning. IEEE Trans. Image Processing, 14(8):1187\u20131201, 2005.\n\n[14] G. Grif\ufb01n, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical Report\n\n7694, California Institute of Technology, 2007.\n\n[15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for\n\nrecognizing natural scene categories. In CVPR, 2006.\n\n[16] P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without\n\nalignments. J. Mol. Biol., 330:771\u2013783, Jul 2003.\n\n[17] Nikil Wale and George Karypis. Comparison of descriptor spaces for chemical compound\n\nretrieval and classi\ufb01cation. In ICDM, pages 678\u2013689, 2006.\n\n[18] N. Shervashidze and K. M. Borgwardt. Fast subtree kernels on graphs. In NIPS, 2010.\n[19] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. CoRR,\n\nphysics/0004057, 2000.\n\n[20] N. Slonim, G. S. Atwal, G. Tkacik, and W. Bialek. Information-based clustering. Proc Natl\n\nAcad Sci U S A, 102(51):18297\u201318302, December 2005.\n\n[21] Francis Bach and Za\u00a8\u0131d Harchaoui. DIFFRAC: a discriminative and \ufb02exible framework for\nclustering. In John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors, NIPS.\nMIT Press, 2007.\n\n[22] Le Song, Alex Smola, Arthur Gretton, and Karsten M. Borgwardt. A dependence maximization\nview of clustering. In ICML \u201907: Proceedings of the 24th international conference on Machine\nlearning, pages 815\u2013822, New York, NY, USA, 2007. ACM.\n\n[23] A. Corduneanu and T. Jaakkola. On information regularization. In UAI, 2003.\n\n9\n\n\f", "award": [], "sourceid": 457, "authors": [{"given_name": "Andreas", "family_name": "Krause", "institution": null}, {"given_name": "Pietro", "family_name": "Perona", "institution": null}, {"given_name": "Ryan", "family_name": "Gomes", "institution": null}]}