{"title": "Clustering with the Fisher Score", "book": "Advances in Neural Information Processing Systems", "page_first": 745, "page_last": 752, "abstract": null, "full_text": "Clustering with the Fisher Score\n\nKoji Tsuda, Motoaki Kawanabe\u0001 and Klaus-Robert M\u00a8uller\u0001\n AIST CBRC, 2-41-6, Aomi, Koto-ku, Tokyo, 135-0064, Japan\n\n\u0001 Fraunhofer FIRST, Kekul\u00b4estr. 7, 12489 Berlin, Germany\n\n\u0002 Dept. of CS, University of Potsdam, A.-Bebel-Str. 89, 14482 Potsdam, Germany\n\nkoji.tsuda@aist.go.jp,\n\n\u0003 nabe,klaus\u0004 @first.fhg.de\n\nAbstract\n\nRecently the Fisher score (or the Fisher kernel) is increasingly used as a\nfeature extractor for classi\ufb01cation problems. The Fisher score is a vector\nof parameter derivatives of loglikelihood of a probabilistic model. This\npaper gives a theoretical analysis about how class information is pre-\nserved in the space of the Fisher score, which turns out that the Fisher\nscore consists of a few important dimensions with class information and\nmany nuisance dimensions. When we perform clustering with the Fisher\nscore, K-Means type methods are obviously inappropriate because they\nmake use of all dimensions. So we will develop a novel but simple clus-\ntering algorithm specialized for the Fisher score, which can exploit im-\nportant dimensions. This algorithm is successfully tested in experiments\nwith arti\ufb01cial data and real data (amino acid sequences).\n\nas follows: Let us assume that a probabilistic model \u0005\u0007\u0006\t\b\u000b\n\nparameter estimate \u0016\n\u001c+\u001d \u001f\"!\n\n\u001c\u001e\u001d \u001f\"!\n\n\u0006\t\b\u0018\r\u001a\u0019\u001b\u0006\n\n\u0005\u0007\u0006\t\b\u000b\n#\u0016\n\f$\r\n\u001c&%\"'\n\n\u000f)()(*(\u0010\u000f\n\n\f\u000e\r\u0010\u000f\u0011\b\u0013\u0012\u0015\u0014\n\u0005\u0007\u0006,\b\u000b\n,\u0016\n\f$\r\n\u001c-%/.\n\n102(\n\n1 Introduction\nClustering is widely used in exploratory analysis for various kinds of data [6]. Among\nthem, discrete data such as biological sequences [2] are especially challenging, because\nef\ufb01cient clustering algorithms e.g. K-Means [6] cannot be used directly. In such cases, one\nnaturally considers to map data to a vector space and perform clustering there. We call the\nmapping a \u201cfeature extractor\u201d. Recently, the Fisher score has been successfully applied as a\nfeature extractor in supervised classi\ufb01cation [5, 15, 14, 13, 16]. The Fisher score is derived\nis available. Given a\n\nfrom training samples, the Fisher score vector is obtained as\n\nThe Fisher kernel refers to the inner product in this space [5]. When combined with high\nperformance classi\ufb01ers such as SVMs, the Fisher kernel often shows superb results [5, 14].\n\nFor successful clustering with the Fisher score, one has to investigate how original classes\nare mapped into the feature space, and select a proper clustering algorithm. In this paper,\nit will be claimed that the Fisher score has only a few dimensions which contains the class\ninformation and a lot of unnecessary nuisance dimensions. So K-Means type clustering [6]\nis obviously inappropriate because it takes all dimensions into account. We will propose\na clustering method specialized for the Fisher score, which exploits important dimensions\nwith class information. This method has an ef\ufb01cient EM-like alternating procedure to learn,\nand has the favorable property that the resultant clusters are invariant to any invertible linear\n\n\u0002\n\f\n\u0017\n\ftransformation. Two experiments with an arti\ufb01cial data and an biological sequence data\nwill be shown to illustrate the effectiveness of our approach.\n\nthe domain of objects (dis-\nthe set of class labels. The feature extraction\n\n2 Preservation of Cluster Structure\nBefore starting, let us determine several notations. Denote by \u0014\n\u0019\u0002\u0001\u0004\u0003\n\u000f\u0006\u0005\b\u0007\ncrete or continuous) and by \n. . Let\u000f\u000b\u0006\t\b\n\u000f)(*()(\nis denoted as \u0017\n\u0006,\b&\r\n\t\u0007\u0014\f\u000b\u000e\r\n\u000f\u0006\u0010\n\r be the underlying joint distribution and assume\n\u0019\u0012\u0011\n\r are well separated, i.e.\u000f\nthat the class distributions\u000f\n\u0006,\b\u000b\n\nFirst of all, let us assume that the marginal distribution\u000f\u000b\u0006\t\b&\r\nprior knowledge of\u000f\n\f$\r\u0010\u000f\n\u0005\u0007\u0006\t\b\nanything about the possible\u000f\u000b\u0006\u0015\u0010\u000e\n\n[12].\n\nis known. Then the problem\nis how to \ufb01nd a good feature extractor, which can preserve class information, based on the\n\u0006,\b&\r . In the Fisher score, it amounts to \ufb01nding a good parametric model\n. . This problem is by no means trivial, since it is in general hard to infer\n\b&\r from the marginal\u000f\u000b\u0006\t\b\u0018\r without additional assumptions\n\nis close to 0 or 1.\n\n\u0019\u0012\u0011\u000e\n\n\u0012\u0014\n\n\b&\n\n\u0006\u0013\u0010\n\nA loss function for feature extraction In order to investigate how the cluster structure\nis preserved, we \ufb01rst have to de\ufb01ne what the class information is. We regard that the class\ninformation is completely preserved, if a set of predictors in the feature space can recover\n\n(2.1)\n\n .\n\n\u0006\u0013\u0010$\n\n\u0016\u0018\u0017\n\n\u001a\u0019\n\n\u0012 \n\n\u000f!\u001d\n\n\u0012\"\n\n\u0006\t\b\u0018\r:\u001b\u001e\u001d<;=\u000f\u000b\u0006\u0015\u0010\n\n34,\u0004.\n57698\n\u0019BA\u0014C\n\n\u0017ED\n\nis dif\ufb01cult to formulate, because param-\n\nto recover the posteriors when classes are totally mixed up. As a predictor of posterior\nprobability in the feature space, we adopt the simplest one, i.e. a linear estimator:\n\nEven when the full class information is preserved, i.e.\nture space may not be easy, because of nuisance dimensions which do not contribute to\n\n\b&\r . This view makes sense, because it is impossible\n\u0006,\b&\r\u001c\u001b\u001e\u001d\n\u000f\u001f\u0019\n\u0019#\u0011\u000e\n\n\u0006\u0013\u0010\n\b&\n\nthe true posterior probability \u000f\n\u0006\t\b\u0018\r\u001a\u0019\u001a\u0019\nThe prediction accuracy of \u0016\u0004\u0017\n\u0006,\b&\r for\u000f\n\u0017 are learned from samples. To make the theoretical analysis possible, we\n\u0017 and \u001d\neters$\nconsider the best possible linear predictors. Thus the loss of feature extractor \u0017\nfor \u0011 -th\nclass is de\ufb01ned as%\n\b&\r4>@?\n\u0019\u0014\u0011\u000e\n\n&('*)\n+-,\u0004.0/21\nis just the sum over all classes%\n6 denote the expectation with the true marginal distribution\u000f\n\u0006\t\b\u0018\r . The overall loss\nwhere5\n\u0019GF , clustering in the fea-\nclustering at all. The posterior predictors make use of an at most \u0005 dimensional subspace\nout of theH -dimensional Fisher score, and the complementary subspace may not have any\nsuch methods, we have to try to minimize the dimensionalityH while keeping%\n\r small.\nFirst, a simple but unrealistic example is shown to achieve %\nOptimal Feature Extraction In the following, we will discuss how to determine \u0005\u0007\u0006\t\b\u000b\n\n\f$\r .\n\r+\u0019#F , without producing\nnuisance dimensions at all. Let us assume that \u0005\u0007\u0006\t\b\u000b\n\n\f$\r\nC\u0006L\n\u0019\u0014\u0005)\r\u0010\u000f\n\u0006Q\u0003\n;\n'RN\n\u0017ED\n\u000f)(*()(\u0010\u000f_\u0005<;`\u0003\u0018\u0007 . Obviously this model realizes\n\u0019V\u0003\n\u0019d\u0003\"\u000f*()(*(\u0010\u000f\u0006\u0005e;f\u0003\"(\n\u000fc\u0011\n\n\u0019P\u0011\n\r\u001c\u001b\n\u000f\u000b\u0006\t\b\u000b\n\n'ON\n\u0017XW\n\u000f^]\n\u0003\"\u000f\nwhereT\nNZY\\[\nthe true marginal distribution\u000f\u000b\u0006\t\b&\r , when\n\rb\t\n\u0019\u0012\u0011\n\u0019a\u000f\u000b\u0006\u0015\u0010\n\nWhen nuisance dimensions cannot be excluded, we will need a different clustering method\nthat is robust to nuisance dimensions. This issue will be discussed in Sec. 3.\n\ninformation about classes. K-means type methods [6] assume a cluster to be hyperspher-\nical, which means that every dimension should contribute to cluster discrimination. For\n\n\u0005JI\n\u0006\t\b\n\u0019V\u0001@K#\t\n\nis determined as a mixture model of\n\nC\u0006L\n\u0017ED\n\n\u0017ED\nC\u0006L\n\ntrue class distributions:\n\nS\u000f\u000b\u0006\t\b\u000b\n\n\u0012UT\n\n(2.2)\n\n\u0010\n\n\f\n0\n\u0017\n\u0017\n\u0017\n\u0017\n.\n\u0017\n(\n\u0017\n\u0006\n\u0017\n\u0019\n0\n\u0017\n\u000f\n\u0006\n\u0017\n\n'\n%\n\u0017\n\u0006\n\u0017\n%\n\u0006\n\u0017\n\n\u0006\n\u0017\n\u0006\n\u0017\n\nK\n\n\u0019\n'\nM\n\u0017\n\u0010\n'\nM\n\u0017\n\u0010\nN\n\u000f\nA\n'\n'\nN\nF\nN\n\u0017\n\u0019\nN\n\n\u0017\n\f\u0006\t\b\n\n\u0006,\b\u000b\n\nLet \n\n(2.3)\n\n$&%('\n\n%*)\n\nC\u0006L\n\nC\u0006L\n\n\u0005JI\n\n\u001d \u001f\"!\n\n\u0006\t\b\n\n\b\u0018\n\n\u000f*()()(*\u000f\n\n\u001d \u001f\"!\n\n\u0019\u001c\n\n\u0019d\u0003\n\n\u000f\u000b\u0006\u0015\u0010\n\u0019\u001f\u000b \n\n\u0006\u0013\u0010\n\n\u0019d\u0003\n\u000f\u000b\u0006\u0015\u0010\n\u0003\n;cA\n\n\u0006\u0013\u0010\n' where\n\nis\n\n\u001c\u001b\t\u0007\n\u0019\u0012\u0011\u000e\n\nC\u0006L\n\b&\r\u0010\u000f)(*()()\u000f\n\n\u000f*()()(*\u000f\u0013\u000f\u000b\u0006\u0015\u0010\n\b\u0018\r\n\u0019P\u0005\"\n\n\b&\r\n\u0017ED\nC\u0006L\n\n\u0006\b\n\u001c\u001e\u001d \u001f\"!\n\u0019\u0001\n7Ib\u001b\f\u000b\u000e\n\nis known but still we\nin Lemma 1 is\n\u0006\t\b\nnever available. In the following, the result of Lemma 1 is relaxed to a more general class\nof probability models by means of the chain rule of derivatives. However, in this case, we\nhave to pay the price: nuisance dimensions.\n\nWhen the Fisher score is derived at the true parameter, it achieves%\n\r achieves%\n\u0019\u0014F .\nLemma 1. The Fisher score \u0017\n\u0019PF .\n\u0006\t\b\u0018\r\u001a\u0019\u0001\u0003\u0002\n(proof) To prove the lemma, it is suf\ufb01cient to show the existence of \u0006\u0013\u0005e;`\u0003\n\r\u0005\u0004 H matrix \u0006\nand\u0005b;\u001e\u0003 dimensional vector \u0007 such that\n\u001d \u001f\"!\n\u0019\u0014\u0005b;\u001e\u0003\n\b\u0018\r1\r\nThe Fisher score for \u0005\n\u0006,\b\u000b\n\n\u0019#\u0003\"\u000f)(*()(\u0010\u000f_\u0005b;\u001e\u0003\nC\u0006L\nC\u0006L\n\n7I+\u0019\u0010\u000fO'\u0012\u0011\n\r\u001a\u0019\u001c\n\n\u0006\t\b\n' and \u0007\n\n\u0003\n;\nand \r\u0014\u0013\u0016\u0015 denotes \u0017\u0018\u0004\u001a\u0019 matrix \ufb01lled with ones. Then\n\u0006\u0013\u0010\n\u0019\u001a\u0005e;f\u0003\n\r10\u001f;\u001d\u000b\u001e\r\n\u001b\u0002\n\b&\r\n' , (2.3) holds.\nWhen we set \u0006\nC\u0006L\nLoose Models and Nuisance Dimensions We assumed that\u000f\n\u0006\t\b\u0018\r\ndo not know the true class distributions\u000f\u000b\u0006\t\b\u000b\n\n . Thus the model \u0005@I\n\u0007 . According\na set of probability distributions !\nDenote by !\n\u0006,\b\u000b\n\nto the information geometry [1], !\nsuch that%\n\" denote the manifold of \u0005\u0007\u0006\t\b\n\f$\r : \"\n( Now the question is how to\n\u0005\u0007\u0006\t\b\ndetermine a manifold \"\n\u0019\u0014F , which is answered by the following theorem.\nTheorem 1. Assume that the true distribution\u000f\nis the true parameter. If the tangent space of \" at\u000f\n\u0006,\b&\r contains the tangent space\nwhere \f\nat the same point (Fig. 1), then the Fisher score \u0017 derived from \u0005\u0007\u0006,\b\u000b\n\nof !\n\r satis\ufb01es\n\u0019\u0014F\nH matrix \u0006\n;a\u0003\n(proof) To prove the theorem, it is suf\ufb01cient to show the existence of \u0006\n\r\u000e\u0004\nand\u0005b;\u001e\u0003 dimensional vector \u0007 such that\n\u0019P\u0005b;f\u0003\n\b&\r1\r\n\u0005\u0007\u0006\t\b\u000b\n\nWhen the tangent space of !\n\u0006\t\b\u0018\r , we have the following\n\u0005JI\n%;:\n\u0019547698\n\u0019\u0010\n\n\u00052I\nis contained in \"\n\nwhere .3/\n\u0005\u0007\u0006\t\b\n\nLet +\n%(:\n\u0006\u0015\u0010\nThe equation (2.4) holds by setting \u0006\n\n( With this notation, (2.5) is rewritten as\n%\u0014)\n\u0019V\u0003\n\b\u0018\r\nC\u0006L\n\u0019\u0010\n\n\u000f*()()()\u000f\u0013\u000f\n\u0006\u0013\u0010\nand \u0007\n\n\u0019\u0014\u0005b;\u001e\u0003\n\u0019\u001c\u000b \n\n\b\u0018\r1\r10\u001f;\u001d\u000b\u000e\r\n' .\nC\u0006L\n\n\f$\r\u001c\u001b\t\u0007\nis contained in that of \"\n\n\u0006\u0013\u0010\n\n\b&\r\u0010\u000f)()(*(\u0010\u000f\u0013\u000f\u000b\u0006\u0015\u0010\naround\u000f\n\u001c&%\n\nis regarded as a manifold in a Riemannian space. Let\n\n\u0019-,\n\nY21\n.0/\n+<\u001b#\n\n\f$\r\u0010\u000f\n\u0006\t\b\u0018\r\n\u0006,\b\u000b\n\n\u0012aT\n\n\u000f\u0006K\n\n:\n\n\u001c+\u001d\n\n\u0006\t\b\u000b\n\n(2.4)\n\n(2.5)\n\n\u0019#\u0003\n\u001d \u001f\"!\n\n\u0005\u0007\u0006,\b\u000b\n\n\u001c&%\n\n\u001d \u001f\"!\n\n\u0006\b\u001b#\n\nby the chain rule:\n\n\u0006,\b&\r\u001a\u0019\n\n\u0005\u0007\u0006\t\b\n\n\u0006\n\u0017\n\n\u0005\nI\n\nK\n\n\u0006\n\u0017\n\n\u0002\n\u0005\nI\n\nK\n\n\u0019\n\u0006\n\u000f\n\n0\n(\nI\nK\n\n\n\u0005\nI\nK\n\n\n\u001c\nN\n\u0017\n\u0019\n\u000f\nN\n\n\u0017\n;\n'\n'\nN\n\n\u0017\n\u000f\n\u0011\n(\n'\n1\n!\n\u0006\n\u0003\nN\n\n'\n\u0003\nN\n\n'\n\n\u000f\n\u000b\n\u0019\n\u0003\nA\n'\nY\nD\n'\nN\n\nY\n\u0005\nI\n\nK\n\n\u0006\n\n\u000f\n\n'\n1\n'\n(\nL\nL\n'\n\n'\n1\n\u0010\n\nK\n\n\u0019\n\u0001\n\nK\n\n\u0019\n\u0001\n\u0005\n\n\f\n\u0012\n\n.\n\u0007\n\u0006\n\u0017\n\n\u000f\n\n\f\n\n\n\u0019\n\u0005\nI\nK\n\n\n\u000f\n\b\n\u0012\n\u0014\n\u000f\n\n\f\n\n%\n\u0006\n\u0017\n\n(\n\u0005\n\u0019\n\u0006\n\u000f\n\n0\n(\n\u001f\n!\nK\n\n\n\u001c\nN\n\u0017\n\u0019\n.\nM\nY\nD\n'\n\u001c\n\f\n\n\nY\nY\n\u001c\nN\n\u0017\n$\n$\n$\nD\n'\n(\n\t\nY\n4\n$\n$\n$\nD\n:\n\u001d\n\u001f\n!\n\n\f\n\n\n\u0006\n\u000f\n\n'\n1\n'\nL\n'\n+\nL\n'\n\n'\n1\n\fM\n\nx\n\np\n\nQ\n\nImportant\n\nNuisance\n\nFigure 1: Information geometric picture of a proba-\nbilistic model whose Fisher score can fully extract the\nis\n, the Fisher score can fully extract the\n\nclass information. When the tangent space of !\nclass information, i.e. %\ncontained in \"\n\n\u001a\u0019PF . Details explained in\n\nthe text.\n\nFigure 2: Feature space constructed by the Fisher score\n\nfrom the samples with two distinct clusters. The \b and\n\u0010 -axis corresponds to an nuisance and an important di-\n\nmension, respectively. When the Euclidean metric is\nused as in K-Means, it is dif\ufb01cult to recover the two\n\u201clines\u201d as clusters.\n\nIn determination of \u0005\u0007\u0006,\b\u000b\n\nmensions (i.e. the tangent space of !\nlarger than \u0005 . But a large H\n\n\f$\r , we face the following dilemma: For capturing important di-\n), the number of parametersH should be suf\ufb01ciently\n\nleads to a lot of nuisance dimensions, which are harmful for\nclustering in the feature space. In typical supervised classi\ufb01cation experiments with hidden\nmarkov models [5, 15, 14], the number of parameters is much larger than the number of\nclasses. However, in supervised scenarios, the existence of nuisance dimensions is not a se-\nrious problem, because advanced supervised classi\ufb01ers such as the support vector machine\nhave a built-in feature selector [7]. However in unsupervised scenarios without class la-\nbels, it is much more dif\ufb01cult to ignore nuisance dimensions. Fig. 2 shows how the feature\nspace looks like, when the number of clusters is two and only one nuisance dimension is\ninvolved. Projected on the important dimension, clusters will be concentrated into two dis-\ntinct points. However, when the Euclidean distance is adopted as in K-Means, it is dif\ufb01cult\nto recover true clusters because two \u201clines\u201d are close to each other.\n\n3 Clustering Algorithm for the Fisher score\n\n\u0012V\n\nbe a set of class labels assigned to \u0001\n\n\u00012\u0010\npose of clustering is to obtain \u0001@\u0010\n\nIn this section, we will develop a new clustering algorithm for the Fisher score. Let\n, respectively. The pur-\n' . As mensioned before,\nin clustering with the Fisher score, it is necessary to capture important dimensions. So far,\nit has been implemented as projection pursuit methods [3], which use general measures for\ninterestingness, e.g. nongaussianity. However, from the last section\u2019s analysis, we know\nmore than nongaussianity about important dimensions of the Fisher score. Thus we will\nconstruct a method specially tuned for the Fisher score.\n\n' only from samples \u0001\n\n. When the class information is fully preserved, i.e. %\n\nLet us assume that the underlying classes are well separated, i.e. \u000f\u000b\u0006\u0015\u0010\n\u0019\u0014F ,\nor 1 for each sample \b\nthere are\u0005 bases in the space of the Fisher score, such that the samples in the\u0011 -th cluster are\nprojected close to 1 on the\u0011 -th basis and the others are projected close to 0. The objective\n\nfunction of our clustering algorithm is designed to detect such bases:\n\n\u0019B\u0011\u000e\n\nis close to 0\n\n&X'\n1\u0004\u0003\u0004\u0003\u0004\u0003\n\n+\t\b\n\n&('*)\n3\n\u0001E1\u0004\u0003\u0004\u0003\u0004\u0003\n\n3\n\b\n\n\u0017ED\n\n\u0006\u0005\n\n&('\n\u0002\u0001E1\u0004\u0003\u0004\u0003\u0004\u0003\n+\u0007\u0001\n\u0006\t\b\u0018\r7\u001b-\u0007 .\n\n\u0006\r\f1\n\nis the indicator function which is 1 if the condition\n\nwhere \u000b\nholds and 0 otherwise.\nNotice that the optimal result of (3.1) is invariant to any invertible linear transformation\n\u0017\u000f\u000e\nIn contrast, K-means type methods are quite sensitive to linear\n\u0006\t\b\u0018\r\ntransformation or data normalization [6]. When linear transformation is notoriously set,\n\n\u0006\t\b\n\n\u001c\u001b\u001e\u001d\n\n\u0006\u0013\u0010;/\n\n\u0019\u0014\u0011\n\n>2?\n\n(3.1)\n\n\u0006\n\u0017\n/\n\u0007\n\u0015\n/\nD\n'\n\b\n/\n\u0007\n\u0015\n/\nD\n'\n\u0012\n\u0014\n/\n\u0007\n\u0015\n/\nD\n\b\n/\n\u0007\n\u0015\n/\nD\n\b\n/\n\n/\n\u0012\n\u0014\n\u0006\n\u0017\n\n)\n1\n)\n1\n1\nC\nM\n'\n\u0015\nM\n/\nD\n'\n8\n\u0019\n0\n\u0017\n\u0017\n/\n\u0017\n;\n\u000b\n\n\u000f\n\f\n\u0019\n\n\u0017\n\fK-means can end up with a false result which may not re\ufb02ect the underlying structure. 1\nThe objective function (3.1) can be minimized by the following EM-like alternating proce-\ndure:\n\n\u0006\t\b\n\n\u0019\u0003\u0002\n\n;\u0004\u0005\n\n1. Initialization: Set \u00012\u0010(/Q\u0007\n\u0006\t\b\n0\u0007\u0006\n2. Repeat 3. and 4. until the convergence of \u00012\u0010\n' and minimize with respect to \u00012\u0019\n3. Fix\u00012\u0010\n\nto initial values. Compute\n' for later use.\n' .\n\u0017ED\n' and\u0001\nobtained as the solution of the following problem:\n\r\u001c\u001b\u001e\u001d!;\n\nThis problem is analytically solved as\n\n\u0019\u0001\u0011\t\b\n\n+71\n\n\u0006,\b\n\n\u0006\t\b\n\n;\u000b\n\n&('\n\u0019\u0012\u0011\n\u0006\u0015\u0010\n\u0019\f\n\n\u0019\u0014\u0011\n\u0006\u0015\u0010\n' and minimize with respect to\u00012\u0010\n\u0006\u0015\u0010\n\u0019P&('*)\n\n\u000f!\u001d\n\n2\n\n\u0006\t\b\n\n\u0010(\nby solving the following problem\n\nwhere\n\n4. Fix\u00012\u0019\n\n\u0017ED\n\n' ,\u0001\b\u001d\n\n\u0017ED\n\n\u0006,\b&\r and\n\nis\n\n\u000f_\u001d\n\n\u0017ED\n\u0019\u0014\u0011\n\n' . Each\u0019\n>2?\n\n\u0006\u0015\u0010\n\n\u0006\u0015\u0019\n' . Each\u0010\n\n\u0006\t\b\n\n10\n\r\u0010\u000f\n/ is obtained\n\nThe solution can be obtained by exhaustive search.\n\nSteps 1, 3, 4 take\n\n ,\n\nH\u000f\u000e\n\nthe computational cost of algorithm is linear in \u0019\n\nsample sizes. This algorithm requires\nonly be an obstacle for an application in an extremely high dimensional data setting.\n\n\u0019\u0014\u0011\n\n\u001c\u001b\u001e\u001d\n\r computational costs, respectively. Since\ntime for inverting the matrix \u0001\n\n, it can be applied to problems with large\n, which may\n\n\u0017JD\n\r ,\n\nH\u0010\u000e\n\n\u0006\t\b\n\n\u0019\f\u0011\n\n\u0010\u000f where\n\n\f$\r\u001a\u0019\nand covariance matrix \u0001\n\n4 Clustering Arti\ufb01cial Data\nWe will perform a clustering experiment with arti\ufb01cially generated data (Fig. 3). Since this\ncomponents is used as a\n\ndenotes the Gaussian distribution with mean\n. The parameters\nare learned with the EM algorithm and the marginal distribution is accurately estimated\nas shown in Fig. 3 (upperleft). We applied the proposed algorithm and K-Means to the\n\ndata has a complicated structure, the Gaussian mixture with \u0017\nprobabilistic model for the Fisher score: \u0005\u0007\u0006\t\b\u000b\n\nY\u0013\u0012\nFisher score calculated by taking derivatives with respect to \u000b\n\npartition, we \ufb01rst divided the points into 8 subclusters by the posterior probability to each\nGaussian. In K-means and our approach de\ufb01ned in Sec. 3, initial clusters are constructed\nby randomly combining these subclusters. For each method, we chose the best result which\nachieved the minimum loss among the local minima obtained from 100 clustering experi-\nments. As a result, the proposed method obtained clearly separated clusters (Fig. 3, upper\nright) but K-Means failed to recover the \u201ccorrect\u201d clusters, which is considered as the effect\nof nuisance dimensions (Fig. 3, lower left). When the Fisher score is whitened (i.e. linear\ntransformation to have mean 0 and unit covariance matrix), the result of K-Means changed\nto Fig. 3 (lowerright) but the solution of our method stayed the same as discussed in Sec. 3.\nOf course, this kind of problem can be solved by many state-of-the-art methods (e.g. [9, 8])\n\nY . In order to have an initial\n\n\u0006,\b\u000b\n\n1When the covariance matrix of each cluster is allowed to be different in K-Means, it becomes\ninvariant to normalization. However this method in turn causes singularities, where a cluster shrinks\nto the delta distribution, and dif\ufb01cult to learn in high dimensional spaces.\n\n\u0015\n/\nD\n'\n\n\u0019\n'\n\u0015\nA\n\u0015\n/\nD\n'\n\u0017\n\u0001\nL\n'\n'\n\u0015\nA\n\u0015\n/\nD\n'\n\u0017\n/\n\n\u0017\n/\n\n0\nL\n/\n\u0007\n\u0015\n/\nD\n/\n\u0007\n\u0015\n/\nD\n\u0017\n\u0007\nC\n\u001d\n\u0017\n\u0007\nC\n\u0017\n\u0017\n,\n\u0019\n\u0017\n\u000f\n\u001d\n\u0017\n1\n!\n)\n3\n\u0015\nM\n/\nD\n'\n8\n\u0019\n0\n\u0017\n/\n\u000b\n/\n\n(\n\u0019\n\u0017\n\u0019\n\u0001\nL\n'\n\u0006\n\u0003\n\u0019\n\u0015\nM\n/\nD\n'\n\u000b\n/\n\n\u0017\n/\n\n\u0017\n\u0017\n\u0017\n;\n\u0003\n\u0019\n\u0015\nM\n/\nD\n'\n\u0017\n\u0017\n/\n\n\u0017\n\u0019\n'\n\u0015\nA\n\u0015\n/\nD\n'\n\u000b\n/\n\u0017\n\u0007\nC\n\u0017\n\u0007\nC\n/\n\u0007\n\u0015\n/\nD\n\u0010\n/\n\nC\nM\n'\n8\n\u0019\n0\n\u0017\n\u0017\n/\n\u0017\n;\n\u000b\n\n>\n?\n\n\u0006\n\u0019\n\n\u0006\n\u0019\n\u0005\nH\n?\n\n\u0006\n\u0019\n\u0005\n?\nH\n\n\u0006\n\nA\n\u0013\nY\nD\n'\n\u000b\n\n\nY\n\u000f\n\u0001\nY\n\u0012\n\n\u000f\n\u0001\n\n\n\fFigure 3: (Upperleft) Toy\ndataset used for cluster-\ning. Contours show the\nestimated density with the\nmixture of 8 Gaussians.\n(Upperright) Clustering re-\nsult of the proposed algo-\nrithm.\n(Lowerleft) Result\nof K-Means with the Fisher\nscore.\n(Lowerright) Re-\nsult of K-Means with the\nwhitened Fisher score.\n\nbecause it is only two dimensional. However these methods typically do not scale to large\ndimensional or discrete problems. Standard mixture modeling methods have dif\ufb01culties in\nmodeling such complicated cluster shapes [9, 10]. One straightforward way is to model\n\nspecial care needs to be taken for such a \u201cmixture of mixtures\u201d problem. When the pa-\n\neach cluster as a Gaussian Mixture: \u000f\u000b\u0006\t\b&\r\nrametersN\n\n( However,\n\u0017\u0005\u0003 are jointly optimized in a maximum likelihood process, the\n\nsolution is not unique. In order to have meaningful results e.g. in our dataset, one has to\nconstrain the parameters such that 8 Gaussians form 2 groups. In the Bayesian framework,\nthis can be done by specifying an appropriate prior distributions on parameters, which can\nbecome rather involved. Roberts et. al. [10] tackled this problem by means of the minimum\nentropy principle using MCMC which is somewhat more complicated than our approach.\n\n\u0017\u0004\u0003 and \u0001\n\nAPC\n\n'\u0002\u0001\n\n\u0017JD\n\n\u0006\t\b\n\n5 Clustering Amino Acid Sequences\nIn this section, we will apply our method to cluster bacterial gyrB amino acid sequences,\nwhere the hidden markov model (HMM) is used to derive the Fisher score. gyrB - gyrase\nsubunit B - is a DNA topoisomerase (type II) which plays essential roles in fundamental\nmechanisms of living organisms such as DNA replication, transcription, recombination and\nrepair etc. One more important feature of gyrB is its capability of being an evolutionary and\ntaxonomic marker alternating popular 16S rRNA [17]. Our data set consists of 55 amino\nacid sequences containing three clusters (9,32,14). The three clusters correspond to three\ngenera of high GC-content gram-positive bacteria, that is, Corynebacteria, Mycobacteria,\nRhodococcus, respectively. Each sequence is represented as a sequence of 20 characters,\neach of which represents an amino acid. The length of each sequence is different from 408\nto 442, which makes it dif\ufb01cult to convert a sequence into a vector of \ufb01xed dimensionality.\n\nIn order to evaluate the partitions we use the Adjusted Rand Index (ARI) [4, 18]. Let\n\nbe the obtained clusters and\n\n/ and\n\n\u000f)()(*(\u0010\u000f\u0007\u0006\t\b\n\u000f)(*()()\u000f\nthe number of samples which belongs to both +\n/ and\nof samples in +\nY , respectively. ARI is de\ufb01ned as\n\u000b\u0010\f\n\n\u0003 and\u0019\n\nY be\nY be the number\n\nbe the ground truth clusters. Let \u0019\nY . Also let\u0019\n\r\u0012\u0011\u0014\u0013\n\nThe attractive point of ARI is that it can measure the difference of two partitions even when\n\n\u000b\r\f\n\u001bfA\n\n;\u000f\u000e\u0013A\n\u000b\u0016\f\n\n\u000b\u0010\f\n\n\u000b\u0015\f\n\u000b\u0016\f\n\n\u000b\u0015\f\n\n\u000b\u0017\f\n\n\u0019\n'\nN\n\u0017\nA\n\u0013\n'\n\nD\n\u0017\n\n\u0012\n\n\n\u0017\n\n\u000f\n\u0001\n\u0017\n\n\n\u0017\n\u000f\n\u0001\n\u0017\n\n\u000f\n\n+\n'\n+\nC\n\u0006\n'\n/\n\u0006\n/\n\u0003\n\u0006\nA\n/\n1\nY\n\n\u0019\n/\nY\n/\n\n\u0019\n/\n\u0003\nA\nY\n\u0006\n\u0019\n\u0003\nY\n\u000b\n\n\u0019\n'\n?\n\u000e\nA\n/\n\n\u0019\n/\n\u0003\nY\n\n\u0019\n\u0003\nY\n\u0011\n;\n\u000e\nA\n/\n\n\u0019\n/\n\u0003\nA\nY\n\n\u0019\n\u0003\nY\n\u0011\n\u0013\n\n\u0019\n\f0.8\n\n0.6\n\nI\n\nR\nA\n\n0.4\n\n0.2\n\n0\n\n2\n\nProposed\nK-Means\n\n5\n\n3\n\n4\n\nNumber of HMM States\n\nFigure 4: Adjusted Rand indices of K-Means and the proposed method in a sequence clas-\nsi\ufb01cation experiment.\n\nthe number of clusters is different. When the two partitions are exactly the same, ARI is 1,\nand the expected value of ARI over random partitions is 0 (see [4] for details).\n\ntransition probabilities and\n\nF characters. This HMM has\n\nIn order to derive the Fisher score, we trained complete-connection HMMs via the Baum-\nis changed from 2 to 5, and each state\nWelch algorithm, where the number of states\ninitial state probabilities,\nterminal\nemits one of\nstate probabilities,\nemission probabilities. Thus when\nfor example, a HMM has 75 parameters in total, which is much larger than the\nnumber of potential classes (i.e. 3). The derivative is taken with respect to all paramaters as\ndescribed in detail in [15]. Notice that we did not perform any normalization to the Fisher\nscore vectors. In order to avoid local minima, we tried 1000 different initial values and\nchose the one which achieved the minimum loss both in K-means and our method. In K-\nMeans, initial centers are sampled from the uniform distribution in the smallest hypercube\n\n/ is sampled from the normal\n\nis initially set to zero.\n\nwhich contains all samples. In the proposed method, every$\ndistribution with mean 0 and standard deviation 0.001. Every \u001d\n\nFig. 4 shows the ARIs of two methods against the number of HMM states. Our method\nshows the highest ARI (0.754) when the number of HMM states is 3, which shows that\nimportant dimensions are successfully discovered from the \u201csea\u201d of nuisance dimensions.\nIn contrast, the ARI of K-Means decreases monotonically as the number of HMM states\nincreases, which shows the K-Means is not robust against nuisance dimensions. But when\n), our method was caught in\nthe number of nuisance dimensions are too many (i.e.\nfalse clusters which happened to appear in nuisance dimensions. This result suggests that\nprior dimensionality reduction may be effective (cf.[11]), but it is beyond the scope of this\npaper.\n\n+\u0019\u0003\u0002\n\n\u000f\u0005\u0004\n\n6 Concluding Remarks\n\nIn this paper, we illustrated how the class information is encoded in the Fisher score: most\ninformation is packed in a few dimensions and there are a lot of nuisance dimensions. Ad-\nvanced supervised classi\ufb01ers such as the support vector machine have a built-in feature\nselector [7], so they can detect important dimensions automatically. However in unsuper-\nvised learning, it is not easy to detect important dimensions because of the lack of class\nlabels. We proposed a novel very simple clustering algorithm that can ignore nuisance\ndimensions and tested it in arti\ufb01cial and real data experiments. An interesting aspect of\nour gyrB experiment is that the ideal scenario assumed in the theory section is not ful\ufb01lled\nanymore as clusters might overlap. Nevertheless our algorithm is robust in this respect and\nachieves highly promising results.\n\n\n\f\n\u0019\n\u000b\n\n\n\n?\n\n\f\n\n\u0019\n\u0001\n\u0017\n\u0017\n\fThe Fisher score derives features using the prior knowledge of the marginal distribution.\n\n[12] without any further assumptions. However, when one knows the\ndirections that the marginal distribution can move (i.e. the model of marginal distribution),\n\nIn general, it is impossible to infer anything about the conditional distribution\u000f\nthe marginal\u000f\u000b\u0006\t\b\u0018\r\nit is possible to extract information about\u000f\u000b\u0006\u0015\u0010\u000e\n\n\b&\r from\n\b&\r , even though it may be corrupted by many\n\nnuisance dimensions. Our method is straightforwardly applicable to the objects to which\nthe Fisher kernel has been applied (e.g. speech signals [13] and documents [16]).\n\n\u0006\u0013\u0010$\n\nAcknowledgement The authors gratefully acknowledge that the bacterial gyrB amino\nacid sequences are offered by courtesy of Identi\ufb01cation and Classi\ufb01cation of Bacteria (ICB)\ndatabase team [17]. KRM thanks for partial support by DFG grant # MU 987/1-1.\nReferences\n\n[1] S. Amari and H. Nagaoka. Methods of Information Geometry, volume 191 of Translations of\n\nMathematical Monographs. American Mathematical Society, 2001.\n\n[2] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic\n\nModels of Proteins and Nucleic Acids. Cambridge University Press, 1998.\n\n[3] P.J. Huber. Projection pursuit. Annals of Statistics, 13:435\u2013475, 1985.\n[4] L. Hubert and P. Arabie. Comparing partitions. J. Classif., pages 193\u2013218, 1985.\n[5] T.S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classi\ufb01ers. In\nM.S. Kearns, S.A. Solla, and D.A. Cohn, editors, Advances in Neural Information Processing\nSystems 11, pages 487\u2013493. MIT Press, 1999.\n\n[6] A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.\n[7] K.-R. M\u00a8uller, S. Mika, G. R\u00a8atsch, K. Tsuda, and B. Sch\u00a8olkopf. An introduction to kernel-based\n\nlearning algorithms. IEEE Trans. Neural Networks, 12(2):181\u2013201, 2001.\n\n[8] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In T. G.\nDietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing\nSystems 14. MIT Press, 2002.\n\n[9] M. Rattray. A model-based distance for clustering. In Proc. IJCNN\u201900, 2000.\n[10] S.J. Roberts, C. Holmes, and D. Denison. Minimum entropy data partitioning using reversible\n\njump markov chain monte carlo. IEEE Trans. Patt. Anal. Mach. Intell., 23(8):909\u2013915, 2001.\n\n[11] V. Roth, J. Laub, J.M. Buhmann, and K.-R. M\u00a8uller. Going metric: Denoising pairwise data. In\n\nNIPS02, 2003. to appear.\n\n[12] M. Seeger.\n\nLearning with labeled and unlabeled data.\n\nstitute\nhttp://www.dai.ed.ac.uk/homes/seeger/papers/review.ps.gz.\n\nand Neural Computation, University\n\nfor Adaptive\n\nTechnical\nof Edinburgh,\n\nreport,\n\nIn-\n2001.\n\n[13] N. Smith and M. Gales. Speech recognition using SVMs. In T.G. Dietterich, S. Becker, and\nZ. Ghahramani, editors, Advances in Neural Information Processing Systems 14. MIT Press,\n2002.\n\n[14] S. Sonnenburg, G. R\u00a8atsch, A. Jagota, and K.-R. M\u00a8uller. New methods for splice site recognition.\n\nIn ICANN\u201902, pages 329\u2013336, 2002.\n\n[15] K. Tsuda, M. Kawanabe, G. R\u00a8atsch, S. Sonnenburg, and K.-R. M\u00a8uller. A new discriminative\n\nkernel from probabilistic models. Neural Computation, 14(10):2397\u20132414, 2002.\n\n[16] A. Vinokourov and M. Girolami. A probabilistic framework for the hierarchic organization and\nclassi\ufb01cation of document collections. Journal of Intelligent Information Systems, 18(2/3):153\u2013\n172, 2002.\n\n[17] K. Watanabe, J.S. Nelson, S. Harayama, and H. Kasai. ICB database: the gyrB database for\n\nidenti\ufb01cation and classi\ufb01cation of bacteria. Nucleic Acids Res., 29:344\u2013345, 2001.\n\n[18] K.Y. Yeung and W.L. Ruzzo. Principal component analysis for clustering gene expression data.\n\nBioinformatics, 17(9):763\u2013774, 2001.\n\n\f", "award": [], "sourceid": 2292, "authors": [{"given_name": "Koji", "family_name": "Tsuda", "institution": null}, {"given_name": "Motoaki", "family_name": "Kawanabe", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}]}