{"title": "Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA", "book": "Advances in Neural Information Processing Systems", "page_first": 1449, "page_last": 1405, "abstract": null, "full_text": "Graph-Driven Features Extraction from\n\nMicroarray Data using Diffusion Kernels and\n\nKernel CCA\n\nJean-Philippe Vert\n\nEcole des Mines de Paris\n\nJean-Philippe.Vert@mines.org\n\nMinoru Kanehisa\n\nBioinformatics Center, Kyoto University\nkanehisa@kuicr.kyoto-u.ac.jp\n\nAbstract\n\nWe present an algorithm to extract features from high-dimensional gene\nexpression pro\ufb01les, based on the knowledge of a graph which links to-\ngether genes known to participate to successive reactions in metabolic\npathways. Motivated by the intuition that biologically relevant features\nare likely to exhibit smoothness with respect to the graph topology, the\nalgorithm involves encoding the graph and the set of expression pro-\n\ufb01les into kernel functions, and performing a generalized form of canoni-\ncal correlation analysis in the corresponding reproducible kernel Hilbert\nspaces.\nFunction prediction experiments for the genes of the yeast S. Cerevisiae\nvalidate this approach by showing a consistent increase in performance\nwhen a state-of-the-art classi\ufb01er uses the vector of features instead of the\noriginal expression pro\ufb01le to predict the functional class of a gene.\n\n1 Introduction\n\nMicroarray technology (DNA chips) is quickly becoming a major data provider in the post-\ngenomics era, enabling the monitoring of the quantity of messenger RNA present in a cell\nfor several thousands genes simultaneously. By submitting cells to various experimental\nconditions and comparing the expression pro\ufb01les of different genes, a better understand-\ning of the regulation mechanisms and functions of each gene is expected. As a matter of\nfact, early experiments con\ufb01rmed that many genes with similar function yield similar ex-\npression patterns [4], and systematic use of state-of-the-art machine learning classi\ufb01cation\nalgorithms highlighted the possibility of gene function prediction from microarray data, at\nleast for some functional categories [2].\n\nIndependently of microarray technology, decades of research in molecular biology have\ncharacterized the roles played by many genes as catalyzing chemical reactions in the cell.\nThis information has now been integrated into databases such as KEGG [8], where series\nof successive chemical reactions arranged into pathways are represented, together with the\ngenes catalyzing them. In particular one can extract from such a database a graph of genes,\nwhere two genes are linked whenever they catalyze two successive reactions.\n\nThe question motivating this report is whether the knowledge of this graph can help im-\nprove the performance of gene function prediction algorithms based on microarray data\n\n\fonly. To this end we propose a graph-driven feature extraction process, based on the idea\nthat expression patterns which correspond to actual biological events, such as the activation\nor inhibition of a particular pathway, are more likely to be shared by genes close to each\nother in the graph than non-relevant patterns. Our approach consists in translating this intu-\nition as a regularized version of canonical component analysis between the genes mapped\nto two reproducible kernel Hilbert spaces, de\ufb01ned respectively by a diffusion kernel [9]\non the graph and a linear kernel on the expression pro\ufb01les. This formulation leads to a\nwell-posed problem equivalent to a generalized eigenvector problem [1].\n\n\u0001\u0003\u0002\u0005\u0004\n\n\n\t\f\u000b\u000e\n\nof cardinality \u0001\n\nThe graph of genes extracted from the pathway database is represented by a simple graph\n\n. The set of\nis the number of measurements and\n. In the sequel we assume that the set of pro\ufb01les has\n\n2 Problem formulation\nThe set of genes is represented by a discrete set \nexpression pro\ufb01les is a mapping \u0006\b\u0007\n, where\u000f\nis the expression pro\ufb01le of gene \u0012\n\u0006\u0011\u0010\u0013\u0012\u0015\u0014\nbeen centered, i.e., \u0016\u0018\u0017\u001a\u0019\u001c\u001b\u001d\u0006\u0011\u0010\u001e\u0012\u001f\u0014 \u0002\"! .\n&%('\n\u0014 , with the genes as vertices. Our goal is use this graph to extract features\n\u0002$\u0010\n, and we denote by -.\u0002\nmapping on the set of genes )*\u0007\nfeatures. The set of centered features is denoted by -0/1\u000232\u001c)546-7\u00078\u0016\"\u00179\u0019\u001c\u001b6):\u0010\u0013\u0012\u0015\u0014 \u0002\u0018!\u001f; .\nIn particular linear features extracted from expression pro\ufb01les )8<>=\n? are de\ufb01ned, for any\n<>=\nnotations, where@\n\nfrom the expression pro\ufb01les. To this end we formally de\ufb01ne a feature to be a real-valued\nthe set of possible\n\n(here and often in the sequel we use matrix\nthe set of linear\n\n\u0006\u0011\u0010\u0013\u0012\u0015\u0014 , for any \u0012C4\n\u0010\u001e\u0012\u001f\u00140\u0002\nis a column vector and @\n\nfeatures. The normalized variance of a linear feature is de\ufb01ned by:\n\n+\t,\u000b\n\n, by )\n\n\u000bA\n\n@8B\n\nIt is a \ufb01rst indicator of the possible relevance of a linear vector. Indeed biological events\nsuch as the synthesis of new molecules usually require the coordinated actions of many\nproteins: they are therefore likely to have characteristic patterns in terms of gene expression\nwhich capture variation between the genes involved and the others, and should therefore\nhave large variance. Linear features with a large normalized variance (1) are called relevant\nin the sequel, as opposed to irrelevant features. Relevant features can be extracted by PCA.\n\nedge of the graph #\n\nWhile the normalized variance (1) is an intrinsic property of the set of pro\ufb01les, the knowl-\nsuggests another criterion to judge \u201cgood\u201d features. As genes linked\ntogether in the graph are supposed to participate in successive reactions in the cell, it is\nlikely that the activation/inhibition of a biochemical pathway has a characteristic expres-\nsion pattern shared by clusters of genes in the graph. More globally, the graph de\ufb01nes a\n.\nA feature is called smooth if it varies slowly between adjacent nodes in the graph, and\nrugged otherwise. As just stated, features of interest are more likely to be smooth than\nother features.\n\nstructure on the set of genes, and therefore a notion of smoothness for any feature )Q45-\n\nWe therefore end up with two criteria for extracting \u201cgood\u201d features: they should simul-\ntaneously be relevant and smooth, the latter being de\ufb01ned with respect to the gene graph.\n, such\nOne way to extract such features is to look for pairs of features,\n\nbe as large as possible. The decoupling of the two criteria enables us to state the problem\nmathematically as follows.\n\nbe a relevant linear feature, and the correlation between )\u0011R and )\n\nthat )\u001aR be smooth, )\nSuppose we can de\ufb01ne a smoothness functional XYRZ\u00070-\nrelevance functional X\n\nfor any feature, and a\nfor linear features, in such a way that lower values of\n\n\t[\u000b]\\\n\n\t^\u000b:\\\n\n\u0007\u0015D\n\n\u0014S4T-VUWD\n\n\u0010K)\u0011R\n\nB its transpose). We call DFE*-\n<>=\n\u0014 \u0002\n\n\u0016\u0018\u0017\u001a\u0019\u001c\u001b&)\n\u0001N\u0001\n\n\u0010\u0013\u0012\u0015\u0014ML\n\n<>=\n\u0001O\u0001\n\n(1)\n\n<>=\n\n4HD\n\n%JI\n\n\u0010K)\n\n\n#\n\u000b\n\u001b\n@\n4\n?\n\n/\nG\n)\n?\n?\n?\n@\nL\nP\n%\n)\nL\nL\nL\nL\n\f) correspond to smoother (resp. more relevant) features. Then\n\nthe functional X\u0015R\n\n(resp. X\n\n\u0010K)\n\n(2)\n\n\u0005\u0007\u0006\t\b\n\n\u0016\u001a\u0018\n\nthe following optimization problem:\n\nis a regularization parameter, is a way to extract smooth and relevant features.\n\n\u0002\u0001\u0004\u0003\n\u0006\u000b\n\t\f\n\u0019\u000e\r\u0010\u000f\u0012\u0011\u0014\u0013\nwhere\u0018\u001c\u001b\n, and\u0018 controls the trade-off between relevance and smoothness on the one hand, and\ncorrelation on the other hand.\u0018\n\nIrrelevance and ruggedness penalize any candidate pair through the functionals X\u000eR and\n(which is obtained by taking )8RS\u0002\nIn order to turn (2) into an algorithm we remark that if X\n\ncan be expressed as norms\nin reproducible kernel Hilbert spaces (RKHS, see Section 3), then (2) takes the form of a\ngeneralization of canonical correlation analysis (CCA) known as kernel-CCA [1], which is\nequivalent to a generalized eigenvector problem. Let us therefore show how to build two\nRKHS on the set of genes whose norms are smoothness (Section 4) and relevance (Section\n5) functionals, respectively.\n\n\u0002\u0018! amounts to \ufb01nding )\u0011R and )\nforces )\u001cR\nR andX\n\nR\u0017\u0016\u0019\u0018\n), while\u0018\u001d\u001b\n\nto be relevant and )\n\nas correlated as possible\nto be\n\nsmooth.\n\n3 Reproducible kernel Hilbert spaces and smoothness functionals\n\n\u0010\u001f\u001e\n\n\u0019\u001c\u001b\n\n\u0012F4\n\nas:\n\n(3)\n\nis an\n\nas follows:\n\nR0/\n\nand the set \u0010\n\nis the multiplicity of ! as an\n\non this basis can be expressed as )5\u0002\n\nLet us brie\ufb02y review basic properties of RKHS relevant for the sequel. The reader is re-\nferred to [12, 14] for more details.\n\n be\nLet\u001e\n\u000b be a Mercer kernel in the sense that the matrix\u001e\n%$ , and\nsymmetric positive semide\ufb01nite. Let!\nbe the linear span of\"#\u001eT\u0010\u0013\u0012\nconsider a decomposition of\u001e\n(-,\u0010(.,\n\u001eV\u0002\nR+*\n& are the eigenvalues of\u001e\nwhere !\u001d/\n( , where7\n\u0014 . The decomposition of any )W42!\nassociated orthonormal basis of eigenvectors in1\u0003L\u001c\u0010\n)43\nR65\neigenvalue. An inner product can be de\ufb01ned in!\n(;:=<\n)>3\n)43\n)43\nR\u00149\nThe resulting Hilbert space!\n%@?\n\u0002C\u001eC\u0010\u001e\u0012\n\u001eC\u0010\n\u001eC\u0010\n\u0014BA\nThe inner product in!\n\u0014 , whereE\n\u0010\u0013\u0012\u0015\u0014\u000b\u001eC\u0010\u001e\u0012\n\u00179\u0019\u001c\u001b%E\nelement of the null space of\u001e\nthis reads )H\u0002F\u001eGE\n%BJ\n%\u000bH\n4I!6L with dual coordinates \u0010\u001fE\n%\u000bH\n\u0002CE\n\u0014B\u001eC\u0010\u0013\u0012\n\u0010LK\nIn particular the!\n-norm of a feature )54I! with dual coordinatesEF4&-\n\u001eGE\n\u0002ME\n\ncan\nis unique up to the addition of an\n. In a matrix form,\n, and using (5) one can easily check that the inner product between two\n\n\u0014\u00034\ncan be easily expressed in a dual form as follows. Each )\n\nbe decomposed as ):\u0010\nfeatures \u0010\n\nand is called the dual coordinate of )\n\nis called a reproducing kernel Hilbert space, due to the\n\nfollowing reproducing property:\n\nrespectively is given by:\n\n4D!\n\nis given by:\n\n(6)\n\n(7)\n\n(4)\n\n(5)\n\n\u0019\u001c\u001b\n\u0001O\u0001\n\n\u0012\u001f\u0014\n\n\u0010\u0013\u0012\u001f\u0014\n\n\u0010\u0013\u0012\n\n\u0001O\u0001\n\nL\n=\n)\nB\nR\n)\nL\n\u0015\n)\nB\nR\n)\nX\nR\nR\n\u0014\n\u0015\n)\nB\nL\n)\nL\nX\nL\n\u0010\n)\nL\n\u0014\n%\n!\nX\nL\nL\n)\nL\n!\nL\nL\n\u0007\n\nL\n\t\n\u0002\n\u0017\n=\n \n\u0014\n\u0005\n\u0017\n=\n \n\f\nE\n-\n%\nP\n\u0014\n%\n&\n'\n(\n)\nB\n(\n%\n*\nP\nP\nP\n/\n*\n,\nR\n%\nP\nP\nP\n%\n,\n&\n\u0014\n4\n-\n&\n\n\u0016\n&\n(\n\\\n(\n,\n8\n&\n'\n(\n\\\nR\n5\n(\n,\n(\n%\n&\n'\n(\n\\\n(\n,\n\u0002\n&\n'\n(\n\\\nR\n5\n(\n9\n(\n*\n(\nP\nG\n%\n\u0012\nB\n\nL\nP\n%\n%\nP\n%\n\u0012\nB\n<\n%\n\u0012\nB\n\u0014\nP\nP\n\u0014\n\u0002\n\u0016\n%\nP\n)\n\u0014\n\u0014\n4\n-\nL\n?\n)\nA\n<\n\u0002\n'\n\u0005\n\u0017\n=\n \n\f\n\nE\nJ\n%\nK\n\u0014\nB\n\u001e\nJ\nP\n)\nL\n<\nB\n%\n\fin the original space1\n\nand the inner product between two features \u0010K)\n):\u0010\u0013\u0012\u001f\u0014\n\nL9\u0010\n\n\u0014 can also be expressed in dual form:\n\n4%!&L with dual coordinates \u0010\n\n%BH\n\n\u0010\u001e\u0012\u001f\u0014\n\n%\tJ\n\n\u0014\u00034&-\n\n(8)\n\nis a subspace of \u000b\n\nWhen \nin the sense that larger values of \u0001N\u0001\n\nthen it is known that the norm in the RKHS de\ufb01ned by\nseveral popular kernels such as the Gaussian radial basis kernel are smoothing functionals,\n\nhigh frequency in their Fourier decomposition. This fact has been much exploited e.g. in\nregularization theory [14, 5], and we now adapt it to the discrete setting.\n\ncorrespond to functions ) with more energy at\n\n\u0001N\u0001\n\n\u0017\u001a\u0019\u001c\u001b\n\n4 Smoothness functional on a graph\n\n(\u0002\n\nand\n\n\u0002\u0004\u0003\n\nA natural way to quantify the smoothness of a feature on a graph is by its energy at high\nfrequency, as computed from its Fourier transform. Fourier transforms on graphs is a clas-\n\nthe diagonal matrix of vertex degrees. Then the \u0004\bU\n\nis called the\n, and is known to share many properties with the continuous Laplacian [11].\n\nif there is an edge between\u0012 andK\nsical tool of spectral graph analysis [3, 11] which we brie\ufb02y recall now. Let \u0002 be the\u00046U\n\u0004 matrix1C\u0002\u0006\u0005\b\u0007\nadjacency matrix of the graph #\n,! otherwise)\nLaplacian of#\nIt is symmetric, semide\ufb01nite positive, and singular. The eigenvector \u0010\t\u0003\n\u0014 belongs to\nthe eigenvalue*\n\u0002\u0018! , whose multiplicity is equal to the number of connected components\nof #\nand\"\nthe eigenvalues of1\nit is known that,\u0010( oscillates more and more on the graph as \n\n%\u000b\n\nLet us denote by !C\u0002\ndecomposition of any feature )546-\n\northonormal set of associated eigenvectors. This basis is a discrete Fourier basis [3], and\nincreases. The Fourier\n\nis the expansion in terms of this basis:\n\n$ an\n\n\u0002\f\u0003\n\n.\n\n,\u0010(\n\n(9)\n\n(10)\n\n&\u0010\u000f\n\nis called the discrete Fourier transform of )\n\n$ , let us now consider the func-\n\n\u000b:\\\u0013\u0012\n\n.\n\nwhere \n\ntion\u001e\u0015\u0014\b\u0007\n\n) and \r\n)6\u0002\n)\u001aR\n\u000b de\ufb01ned by:\n\u0010\u0013\u0012\n\nFor any monotonic decreasing mapping\n\n\u0010LK\nbeing assumed to take only positive values, the matrix\u001e\u0016\u0014\n\n\u0014 \u0002\ntive and is therefore a Mercer kernel on the set \nfeatures -\n\n\u001e\u0015\u0014\u001a\u0010\u001e\u0012\n\n, with norm given by:\n\nThe mapping\n\n\u0010\u0013\u0012\u0015\u0014\n\nis de\ufb01nite posi-\n. The corresponding RKHS is the set of\n\n46-\n\n)\u0015L\n\n(11)\n\n\u0001N\u0001\n\n\u0017\u0019\u0018\n\n\u0001O\u0001\n\u0014 decreases. As a result the norm (11) has a higher\n\u0017 , where\n\nvalue on features which have a lot of energy at high frequency, and is therefore a natural\nsmoothing functional.\n\nfunction with rapid decay is the exponential\n\nis a parameter. In that case we recover the diffusion kernel introduced and discussed in\n[9]. Considering other mapping\nwould be beyond the scope of this report, so we restrict\nourselves to this diffusion kernel in the sequel. Observe that it can be expressed using the\n\n\u0010\u0013\u0012\u001f\u0014\n\n\u00027\u0006\u0010\u001a\u001c\u001b\n\nincreases so\n\nAs \n\nAn example of valid\n\nincreases,*\nmatrix exponential as\u001e\u001e\u00141\u0002\u0006\u001f\n\n\u0003\u0010 \n\n\u0010\t\u0007\n\n\u0014 .\n\n\u0014\nE\nL\n\n)\nB\nH\n\u0002\n'\nH\n\u0002\nE\nB\n\u001e\nL\nJ\nP\n\u0001\n)\n<\n\u0004\n\u0017\n=\n \n\u0005\n\u0002\n%\nP\nP\nP\n%\n\u0003\nR\n*\nR\n/\nP\nP\nP\n/\n*\n&\n,\n(\n%\nP\nP\nP\n%\n\u0004\n)\n\u0002\n&\n'\n(\n)\nR\n\n)\n(\n%\n)\n(\n\u0002\n,\nB\n(\n\u000e\n\n%\nP\nP\nP\n%\n\n)\n\u0011\n\u0007\n\u000b\n\\\n\t\n\"\n!\n\nL\n\t\nG\n%\nK\n\u0014\n4\n\nL\n%\n%\nK\n&\n'\n(\n)\nR\n\u0011\n\u0010\n*\n(\n\u0014\n,\n(\n,\n(\n\u0014\nP\n\u0011\nG\n)\n%\n)\nL\n\u0002\n&\n'\n(\n)\nR\n\n(\n\u0011\n\u0010\n*\n(\n\u0014\nP\n(\n\u0011\n\u0010\n*\n(\n\u0011\n\u0011\n\u001d\n\u0011\n\u001d\n1\n\fassociated with this semide\ufb01nite positive matrix consists of the set of features of\n\n\u000f .\n\n\u001d$\n\n\u0001O\u0001\n\n\u0001O\u0001\n<>=\n\u0001N\u0001\n\n\u0016\u0019\u0018\n\f and \u0001N\u0001\n\n\u0012\"4\n\n\u00068\u0010\u001e\u0012\u001f\u0014\n\nthen )\u001c<>=\n)9<>=\ncan be parametrized by directions of the form @\nis called the dual coordinate of @ and is de\ufb01ned up to the\n\u0014 . The RKHS\n\u0006\u0011\u0010\u001e\u0012\u001f\u0014\n\u0010\u001e\u0012\u001f\u0014(\u00068\u0010\u001e\u0012\u001f\u0014 . In other words\n\u00140\u0002\n\n\u0006\u0011\u0010LK\n\n\u00179\u0019\u001c\u001b\n\n.\n\nD can be expressed by (1), (6) and (8) as follows:\n\u0016\u0018\u00179\u0019\u001c\u001b\n\nin (2) is the norm\nis the RKHS associated with the linear kernel\n\n\u0001N\u0001\n\n\u0010K)9<\n\n<>=\n\u0001N\u0001\n\n\u0010\u001e\u0012\u001f\u0014\n\n\u0017\u001a\u0019\u001c\u001b\n\n\u0010\u0013\u0012\u0015\u0014M\u0006\u0011\u0010\u0013\u0012\u0015\u0014 , whereJ\n\n\u000bA\r has a projection @\n4W-\n\n<>=\n\u0001N\u0001\n\u0001N\u0001\nAs a result, a natural relevance functional to balance the term \u0001N\u0001\nin the RKHS: X\n\u0002\"\u0006\u0011\u0010\u0013\u0012\u0015\u0014\n\n5 Relevance functional\nIf @\nAs a result the set of linear features D\n\u00179\u0019\u001c\u001b\nthe form ):\u0010\nThe variance of a feature )54\n?9\u0014\n\n/ onto the linear span of\"\naddition of an element of the null space of the Gram matrix\u001e\n\u0010\u0013\u0012\u0015\u0014\u000b\u001eC\u0010\u001e\u0012\nthis is exactly the set of linear features,!\n< , where @\n, where!\n\b as a smoothness function for any )\n\u0005\u0002\u0001\n\u0010.\u001e\n\u0010\u001fE\nU\u0002!\n\n\u0014 denote the diffusion kernel and \u001e\n\u0014 , with associated RKHS!&R and!\n%\tJ\n\n\u0005\u0004\n\u0014\u0014E]\u0014\n\u0010.\u001e\n\n\u0010\u0013\u0012\n\u0001N\u0001\n\u0001N\u0001\ntional for any linear feature )54&D\n\n\u0006\u0011\u0010LK\n\u0003\u0010 \n\u00068\u0010\u001fK\n\u0010\u000b\u0007\n\u0002\u0001\u0004\u0003\n\u0019\u000e\n\n6 Extracting smooth correlations\n\n\u001eC\u0010\u001e\u0012\nLet\u001e\n\n\u00027\u0006\u0011\u0010\u0013\u0012\u001f\u0014\n\n, and X\n\n\u0016\u0019\u0018\n\nform as:\n\n4&-\n\n\u0014 .\n\n<>=\n\n\u0010K)\n\n\u0010K)\n\n\u0001N\u0001\n\n\u0001O\u0001\n\n, we can express the maximization Problem (2) in a dual\n\ndenote the linear kernel\n\nrespectively. Taking X\u0015R9\u0010K)\n\u0014 \u0002\n\u0001N\u0001\n\n as a relevance func-\n\n\u0014\b\u0002\n\n\u0001N\u0001\n\n(12)\n\n%\tJ\n\n)\u001cR\u001a\u0001O\u0001\n\n)\u001cR9\u0001N\u0001\n\neigenvalue problem:\n\n, namely its average, is null).\n\n, while both \u0001N\u0001\n\n. Indeed the numerator remains unchanged when a constant\n\n\b are minimized\n\n(for the latter case, this results from the fact that the constant vector is\nan eigenvector of the diffusion kernel, so the norm de\ufb01ned by (4) is minimized when the\n\n\u0014S4\nU6D as in (2). However it can be checked that any solution\n\nAt \ufb01rst sight it seems that (12) is the dual formulation of an optimization over \u0010\n, and not -\n-\nU\nof (12) is in fact in -\nUZD\nfunction is added to )\u001cR\nwhen ) has mean !\ncorresponding projection of )\nshow that \u0010\n\nFormulated as (12) the problem appears to be a generalization of canonical correlation\nanalysis (CCA) known as kernel-CCA, discussed in [1]. In particular Bach and Jordan\nis a solution of (12) if and only if it satis\ufb01es the following generalized\n\nthe largest possible. Moreover, solving (13) provides a series of pairs of features\nfor\nis null, equivalent to the extraction of successive canonical di-\n\n\u001e6R\n%BJ\n$ , where \u000b\n( are therefore a set of features likely to have decreasing biological rele-\n%\f\u000b\n\n\u001e5L\n\u0016\u0019\u0018\n%BJ\n\"8\u0010\n\u000f\u001f\u0014 , with decreasing values of\u0004\nrections with decreasing correlation in classical CCA. The resulting features )\nand )\nAs discussed in [1] we regularize the problem (13) by adding\u0018\nvance when \n\non the diagonal of the\nmatrix on the right-side, to be able to perform the Cholesky decomposition necessary to\n\nincreases, and are the features we propose to extract in this report.\n\n\u0002\n\t\n\u000e\r\u0002\u000f\n\nwhich the gradient\n\n\u0016\u0019\u0018\n\n\u001e5L\n\nwith\n\n\u0010\u001fE\n\n(13)\n\n\u0010\u001e\u0004\n\nL\u0012\u0011\u0012\u0013\n\n4\n%\n?\n\u0002\n?\n\u0002\n\u0016\nJ\n\u0017\n=\n \n\u0002\nB\n!\nE\n-\nP\n\u0014\n\u0002\n\u0016\nJ\n%\nP\n)\n?\n=\n\u0002\n\u0016\nJ\n\u0002\nD\nI\n=\n\u0002\n)\n?\nL\n@\nL\n\u0002\nJ\nB\n\u001e\nL\nJ\nJ\nB\n\u001e\nJ\n\u0002\n)\n?\n\n\n\u0005\n\u001b\n\f\n)\n?\n<\nP\n)\n\n\n\u0005\n\u001b\n\f\nL\n<\n=\n?\n\u0014\n\u0002\n)\n?\n<\n%\nK\n\u0014\nB\nR\n\u0002\n\u001f\n\u001d\n1\nL\n\u001e\nL\n%\nK\n\u0014\nB\nL\n)\n<\nL\n)\n<\n=\n\u0003\n\f\n\u0010\nE\n\u0014\n\u0006\n\u0002\nE\nB\n\u001e\nR\n\u001e\nL\nJ\nB\nL\nR\n\u001e\nR\n\b\n\n\u0010\nJ\nB\nL\nL\n\u001e\nL\n\u0014\nJ\n\u0014\n\b\n\nP\n)\nR\n%\n)\nL\n!\nR\nL\n\u0002\nD\n/\n/\n\u0002\n\u001e\nR\nE\n4\n-\n\n\n\u0005\n\u001b\n<\nE\n\u0014\n\u0007\n!\n\u001e\nR\n\u001e\nL\n\u001e\nL\n!\n\b\n\u0007\nE\nJ\n\b\n\u0007\nR\n\u001e\nR\n!\n!\nL\n\u001e\nL\n\b\n\u0007\nE\nJ\n\b\n\t\nE\n(\n(\n\u0014\n%\n\n\u0002\n\u0003\n%\nP\nP\nP\n\u0004\n\u0004\n\u0002\n%\n(\n(\n\u0014\n\u0010\n\u0001\n=\n\u0003\n\u0004\nR\n=\n(\n\u0002\n\u001e\nR\nE\n(\nL\n=\n(\n\u0002\n\u001e\nL\nJ\n\fsolve this problem. Hence we end up with the following problem:\n\nR\u0017\u0016\u0019\u0018\n\nwhere\u0018\n\nsymmetric : \u0010\n\n\u0011\u0003\u0002\n\n%BJ\n\n. If \u0010\u001fE\n\ngeneralized eigenvalue\n\n7 Experiments\n\n, then \u0010\t\u0007=E\n\n%BJ\n\n\u0014 belong to\n\u0014 with\n\nR\u0005\u0004\n\nis an generalized eigenvector solution of (14) belonging to the\n. As a result the spectrum of (14) is\n\n(14)\n\n\u0016\u0019\u0018\n\n\u0010.\u001e\n\n,\n\n.\n\n\u0002\u0018!\n\nfor\n\nWe extracted from the LIGAND database of chemical compounds of reactions in biological\npathways [6] a graph made of 774 genes of the budding yeast S. Cerevisiae, linked through\n16,650 edges, where two genes are linked when they have the possibility to catalyze two\nsuccessive reactions in the LIGAND database (i.e, two reactions such that the main product\nof the \ufb01rst one be the main substrate of the second one). Expression data were collected\nfrom the Stanford Microarray Database [13]. Concatenating several publicly available data,\nwe ended up with 330 measurements for 6075 genes of the yeast, i.e., almost all its known\nor predicted genes. Following [4, 2] we work with the normalized logarithm of the ratio\nof expression levels of the genes between two experimental conditions. The functional\nclasses of the yeast genes we consider are the one de\ufb01ned by the January 10, 2002 version\nof the Comprehensive Yeast Genome Database (CYGD) [10], which is a comprehensive\nclassi\ufb01cation of 3,936 genes into 259 categories.\n\nThe 669 genes in the gene graph with known expression pro\ufb01les were \ufb01rst used to perform\nthe feature extraction process described in this report. The resulting linear features were\nthen extracted from the expression pro\ufb01les of the disjoint set of 2,688 genes which are\nin the CYGD functional catalogue but not in the pathway database. We then performed\nfunctional classi\ufb01cation experiments on this set of 2,688 genes, using either the pro\ufb01les\nthemselves or the features extracted. All functional classes with more than 20 members in\nthis set were tested (which amount to 115 categories).\n\n\u0003\u0010 \n\n\u0010\u001e\u0012\n\n\u0014\b\u0002\n\nExperiments were carried out with SVM Light [7], a public and free implementation of\nSVM. All vectors were scaled to unit length before being sent to the SVM, and all SVM\nuse a radial basis kernel with unit width, i.e.,\nparameter between training error and margin error was set to its default value (\nin that\ncase), and the cost of errors on positive and and negative examples were adjusted to have\nthe same total.\n\n\u0007\u001aKY\u0001N\u0001\n\u0010\t\u0007S\u0001N\u0001\nthe diffusion kernel\u001d and the regularization parameter\u0018 , showed that\u001d\n\nPreliminary experiments to tune the two parameters of the algorithm, namely the width of\n\n\u0014 . The trade-off\n\nand\u0018\n\nfor a perfect classi\ufb01er and\n\nprovide good performances. For these values we \ufb01rst tested whether there exists an optimal\nnumber of features to be extracted for optimal gene function prediction. Figure 1 shows the\nperformance of SVM using different numbers of features, in terms of ROC index averaged\nover all 115 classes. The ROC index is the area under the curve of false negative vs true\nfor a random classi\ufb01er. For each\npositive, normalized to\ncategory the ROC index was averaged over\nrandom splitting of the data into training and\ntest set, in the proportion\nperformance averaged over all categories. A more precise analysis of the different classes\nshows however that some classes don\u2019t follow the average trend and are better predicted\nby a smaller number of features, as shown on Figure 2 for\ncategories best predicted by\nfeatures. Finally Figure 3 compares, for each of the 115 categories, the ROC\nless than\nindex for a SVM using the original expression pro\ufb01les with a SVM using the vectors of\n330 features. It demonstrates that the representation of genes as vectors of features helps\nimprove the performance of SVM (the ROC index averaged over all categories increases\n\n\u0011\u0003\u00029! . It appears that the more features are included, the better the\n\n!\u001a!\n\b\u001a!\n\n\u0002\"!\n\n!\u001a!\n\n!\u001a!\n\n\u00079!\n\n\u0007\n!\n\u001e\nR\n\u001e\nL\n\u001e\nL\n\u001e\nR\n!\n\b\n\u0007\nE\nJ\n\b\n\u0002\n\t\n\u0007\n\u0010\n\u001e\nB\n\u0001\n\u0014\nL\n!\n!\nL\nB\n\u0001\n\u0014\nL\n\b\n\u0007\nE\nJ\n\b\n%\nB\n\u0002\n\u0018\n\u0014\n\t\n\u0007\n\t\n\t\nR\n%\n\u0007\n\t\nR\n%\nP\nP\nP\n%\n\t\n&\n%\n\u0007\n\t\n&\n\t\nP\nP\nP\n\u0004\n\t\n&\n\t\n(\n\u001b\n\u000f\n\u0006\n%\nK\n\u001f\n\u0012\nL\n\u0003\n\u0002\n\u0003\nP\n\u0003\n\u0003\n\u0003\n!\n\u0007\n\u0003\n\fx\ne\nd\nn\n\ni\n \n\n \n\nC\nO\nR\ne\ng\na\nr\ne\nv\nA\n\n62\n\n61\n\n60\n\n59\n\n58\n\n57\n\n56\n\n55\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n350\n\nNumber of features\n\nFigure 1: ROC index averaged over 115 categories, for various number of features\n\nx\ne\nd\nn\n\ni\n \n\nC\nO\nR\n\n75\n\n70\n\n65\n\n60\n\n55\n\n50\n\n45\n\n50\n\nPrediction performance for several functional classes\n\n\"fermentation\"\n\"ionic_homeostasis\"\n\"protein_complexes\"\n\"vacuolar_transport\"\n\"nucleus_organization\"\n\n100\n\n150\n\n200\n\n250\n\n300\n\n350\n\nNumber of features\n\nFigure 2: ROC index for 5 functional categories, for various number of features\n\n). The difference is especially important for classes such as heavy metal\n\nvs\n\n), ribosome biogenesis (\n\n), protein synthesis (\n\n vs\n\nto \n\nfrom\nion transporters (\nvs \n\n ) or morphogenesis (\n\n\b\u0002\u0001\n\nvs\n\n)\n\n8 Discussion and Conclusion\n\n\b\f\u0013\n\nResults reported in the previous section are encouraging for at least two reasons. First of\nall, the performance reached for some classes such as heavy ion metal transporters shows\nthat a ROC above 80% can be expected for several classes. Second, while many classes are\napparently not learned by the SVM based on expression pro\ufb01les (ROC around 50), the ROC\nbased on extracted features of the same classes is around 60. This shows that there is hope\nto be able to predict more functional classes than previously thought [2] from microarray\ndata, which is a good news since the amount of microarray data is expected to explode in\nthe coming years.\n\nThe method presented in this paper can be seen as an attempt to explore the possibilities of\ndata mining and analysis provided by kernel methods. Few studies have used kernel meth-\nods other than SVM, and have used kernels other than Gaussian or polynomial kernels. In\nthis report we tried to show how \u201cexotic\u201d kernels such as the diffusion kernel, and \u201cexotic\u201d\nmethods such as kernel-CCA, can be adapted to particular problems, graph-driven feature\nextraction in our case. Exploring other possibilities of kernel methods in the data-rich \ufb01eld\nof computational biology is among our future plans.\n\n\u0007\n\u0013\nP\n\n\u0003\nP\n\u0013\nP\n\u0007\n\u0007\n\u0007\nP\n\u0002\n\u0003\n\u0013\nP\n\u0004\n!\nP\n\u0003\nP\n\u0001\n\u0003\nP\n\u0013\n%\n\u0004\n\u0013\n\u0013\nP\n\u0001\n\fs\ne\n\nl\ni\nf\n\ni\n\n \n\no\nr\np\nn\no\ns\ns\ne\nr\np\nx\ne\n\n \n\n \n\nn\no\nd\ne\ns\na\nb\n\n \nx\ne\nd\nn\n\ni\n \n\nC\nO\nR\n\n100\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nROC index based on extracted features\n\nFigure 3: ROC index of a SVM classi\ufb01er based on expression pro\ufb01les (y axis) or extracted\nfeatures (x axis). Each point represents one functional category.\n\nReferences\n[1] F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of Machine\n\nLearning Research, 3:1\u201348, 2002.\n\n[2] Michael P. S. Brown, William Noble Grundy, David Lin, Nello Cristianini, Charles Walsh Sug-\nnet, Terence S. Furey, Jr. Manuel Ares, and David Haussler. Knowledge-based analysis of\nmicroarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA,\n97:262\u2013267, 2000.\n\n[3] Fan R.K. Chung. Spectral graph theory, volume 92 of CBMS Regional Conference Series.\n\nAmerican Mathematical Society, Providence, 1997.\n\n[4] Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. Cluster analysis\nand display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95:14863\u201314868,\nDec 1998.\n\n[5] Frederico Girosi, Michael Jones, and Tomaso Poggio. Regularization theory and neural net-\n\nworks architectures. Neural Computation, 7(2):219\u2013269, 1995.\n\n[6] S. Goto, Y. Okuno, M. Hattori, T. Nishioka, and M. Kanehisa. LIGAND: database of chemical\n\ncompounds and reactions in biological pathways. Nucleic Acid Research, 30:402\u2013404, 2002.\n\n[7] Thorsten Joachims. Making large-scale svm learning practical. In B. Sch\u00a8olkopf, C. Burges,\nand A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 169\u2013184.\nMIT Press, 1999.\n\n[8] M. Kanehisa, S. Goto, S. Kawashima, and A. Nakaya. The KEGG databases at GenomeNet.\n\nNucleic Acid Research, 30:42\u201346, 2002.\n\n[9] R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input. In ICML\n\n2002, 2002.\n\n[10] H.W. Mewes, D. Frishman, U. G\u00a8uldener, G. Mannhaupt, K. Mayer, M. Mokrejs, B. Morgen-\nstern, M. M\u00a8unsterkoetter, S. Rudd, and B. Weil. MIPS: a database for genomes and protein\nsequences. Nucleic Acid Research, 30(1):31\u201334, 2002.\n\n[11] B. Mohar. Some applications of laplace eigenvalues of graphs. In G. Hahn and G. Sabidussi,\neditors, Graph Symmetry: Algebraic Methods and Applications, volume 497 of NATO ASI\nSeries C, pages 227\u2013275. Kluwer, Dordrecht, 1997.\n\n[12] S. Saitoh. Theory of reproducing Kernels and its applications. Longman Scienti\ufb01c & Technical,\n\nHarlow, UK, 1988.\n\n[13] G. Sherlock, T. Hernandez-Boussard, A. Kasarskis, G. Binkley, J.C. Matese, S.S. Dwight,\nM. Kaloper, S. Weng, H. Jin, C.A. Ball, M.B. Eisen, and P.T. Spellman. The stanford mi-\ncroarray database. Nucleic Acid Research, 29(1):152\u2013155, Jan 2001.\n\n[14] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Confer-\n\nence Series in Applied Mathematics. SIAM, Philadelphia, 1990.\n\n\f", "award": [], "sourceid": 2273, "authors": [{"given_name": "Jean-philippe", "family_name": "Vert", "institution": null}, {"given_name": "Minoru", "family_name": "Kanehisa", "institution": null}]}