{"title": "Hierarchical Fisher Kernels for Longitudinal Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1961, "page_last": 1968, "abstract": "We develop new techniques for time series classification based on hierarchical Bayesian generative models (called mixed-effect models) and the Fisher kernel derived from them. A key advantage of the new formulation is that one can compute the Fisher information matrix despite varying sequence lengths and sampling times. We therefore can avoid the ad hoc replacement of Fisher information matrix with the identity matrix commonly used in literature, which destroys the geometrical grounding of the kernel construction. In contrast, our construction retains the proper geometric structure resulting in a kernel that is properly invariant under change of coordinates in the model parameter space. Experiments on detecting cognitive decline show that classifiers based on the proposed kernel out-perform those based on generative models and other feature extraction routines.", "full_text": "Hierarchical Fisher Kernels for Longitudinal Data\n\nZhengdong Lu\n\nTodd K. Leen\n\nDept. of Computer Science & Engineering\n\nOregon Health & Science University\n\nBeaverton, OR 97006\n\nluz@cs.utexas.edu,tleen@csee.ogi.edu\n\nJeffrey Kaye\n\nLayton Aging & Alzheimer\u2019s Disease Center\n\nOregon Health & Science University\n\nPortland, OR 97201\nkaye@ohsu.edu\n\nAbstract\n\nWe develop new techniques for time series classi\ufb01cation based on hierarchical Bayesian\ngenerative models (called mixed-effect models) and the Fisher kernel derived from them.\nA key advantage of the new formulation is that one can compute the Fisher informa-\ntion matrix despite varying sequence lengths and varying sampling intervals. This avoids\nthe commonly-used ad hoc replacement of the Fisher information matrix with the iden-\ntity which destroys the geometric invariance of the kernel. Our construction retains the\ngeometric invariance, resulting in a kernel that is properly invariant under change of co-\nordinates in the model parameter space. Experiments on detecting cognitive decline show\nthat classi\ufb01ers based on the proposed kernel out-perform those based on generative models\nand other feature extraction routines, and on Fisher kernels that use the identity in place\nof the Fisher information.\n\n1 Introduction\nTime series classi\ufb01cation arises in diverse application. This paper develops new techniques based on hi-\nerarchical Bayesian generative models and the Fisher kernel derived from them. A key advantage of the\nnew formulation is that, despite varying sequence lengths and sampling times, one can compute the Fisher\ninformation matrix. This avoids its common ad hoc replacement with the identity matrix. The latter strategy,\ncommon in the biological sequence literature [4], destroys the geometrical invariance of the kernel. Our con-\nstruction retains the proper geometric structure, resulting in a kernel that is properly invariant under change\nof coordinates in the model parameter space.\nThis work was motivated by the need to classify clinical longitudinal data on human motor and psychometric\ntest performance. Clinical studies show that at the population level progressive slowing of walking and the\nrate at which a subject can tap their \ufb01ngers are predictive of cognitive decline years before its manifestation\n[1]. Similarly, performance on psychometric tests such as delayed recall of a story or word lists( tests\nnot used in diagnosis), are predictive of cognitive decline [8]. An early predictor of cognitive decline for\nindividual patients based on such longitudinal data would improve medical care and planning for assistance.\n\n1\n\n\fOur new Fisher kernels use mixed-effects models [6] as the generative process. These are hierarchical\nmodels that describe the population (consisting of many individuals) as a whole, and variations between\nindividuals in the population. The population model parameters (called \ufb01xed effects), the covariance of\nthe between-individual variability (the random effects), and the additive noise variance are \ufb01t by maximum\nlikelihood. The overall population model together with the covariance of the random effects comprise a set\nof parameters for the prior on an individual subject model, so the \ufb01tting scheme is a hierarchical empirical\nBayesian procedure.\n\nData Description The data for this study was drawn from the Oregon Brain Aging Study (OBAS) [2], a\nlongitudinal study spanning up to \ufb01fteen years with roughly yearly assessment of subjects. For our work,\nwe grouped the subjects into two classes: those who remain cognitively healthy through the course of the\nstudy (denoted normal), and those who progress to mild cognitive impairment (MCI) or further to dementia\n(denoted impaired). Since we are interested in prediction, we retain only data taken prior to diagnosis of\nimpairment. We use 97 subjects from the normal group and 46 from the group that becomes impaired.\nMotor task data included the time (denoted as seconds) and the number of steps (denoted as steps) to walk\n9 meters, and the number of times the subject can tap their fore\ufb01nger, both dominant (tappingD) and non-\ndominant hands (tappingN) in 10 seconds. Psychometric test data include delayed-recall, which measures\nthe number of words from a list of 10 that the subject can recall one minute after hearing the list, and logical\nmemory II in which the subject is graded on recall of a story told 15-20 minutes earlier.\n\n2 Mixed-effect Models\n2.1 Mixed-effect Regression Models\nIn this paper, we con\ufb01ne attention to parametric regression. Suppose there are k individuals (indexed by i =\n1, . . . , k) contributing data to the sample, and we have observations {ti\nn}, n = 1, . . . , N i as a function\nof time t for individual i. The data are modeled as yi\nn, where \u03b3i are the regression\nparameters and \u0001i\nn is zero-mean white Gaussian noise with (unknown) variance \u03c32. The superscript on the\nmodel parameters \u03b3i indicates that the regression parameters are different for each individual contributing to\nthe population. Since the model parameters vary between individuals, it is natural to consider them generated\nby the sum of a \ufb01xed and a random piece: \u03b3i = \u03b1 + \u03b2i, where \u03b2i (called the random effect), is assumed\ndistributed N (0, D) with unknown covariance D. The expected parameter vector \u03b1, called the \ufb01xed effect,\ndetermines the model for the population as a whole, and the random effect \u03b2i accounts for the differences\nbetween individuals. This intuition is most precise for the case in which the model is linear in parameters\n\nn; \u03b3i) + \u0001i\n\nn = f(ti\n\nn, yi\n\n(1)\nwhere \u03a6(t) = [\u03c61(t), \u03c62(t), ..., \u03c6d(t)]T denotes a vector of basis functions1. We use M = {\u03b1, D, \u03c3} to\ndenote the mixed-effect model parameters. The feature values, observation times, and observation noise are\n\nf(t; \u03b3) = \u03b3T \u03a6(t) = \u03b1T \u03a6(t) + \u03b2T \u03a6(t)\n\nN i]T ,\n\nyi \u2261 [yi\n\n1,\u00b7\u00b7\u00b7 , yi\n2.2 Maximum Likelihood Fitting\nModel \ufb01tting uses the entire collection of data {ti, yi}, i = 1, . . . , k to determine the parameters M =\n(cid:90)\n{\u03b1, D, \u03c3} by maximum likelihood. The likelihood of the data {ti, yi} given M is\n\n1,\u00b7\u00b7\u00b7 , \u0001i\n\n1,\u00b7\u00b7\u00b7 , ti\n\nti \u2261 [ti\n\nN i]T ,\n\n\u0001i \u2261 [\u0001i\n\nN i]T .\n\np(yi; ti,M) =\n\np(yi| \u03b2i; ti, \u03c3)p(\u03b2i|M)d\u03b2i\n\n= (2\u03c0)\u2212N i/2|\u03a3i|\u22121/2 exp((yi \u2212 \u03b1T \u03a6(ti))T (\u03a3i)\u22121(yi \u2212 \u03b1T \u03a6i(ti)))\n\nwhere\n\n1More generally, the \ufb01xed and random effects can be associated with different basis functions.\n\n2\n\n(2)\n\n(3)\n\n\fSeconds: Normal\n\nSeconds: Impaired\n\nlogical memory II: Normal\n\nlogical memory II: Impaired\n\n\u03b1T \u03a6(t). The two green lines stand for \u03b1T \u03a6(t) \u00b1(cid:112)\n\nFigure 1: The \ufb01t mixed-effect models for two tests. In each panel, the red line stands for the \ufb01xed effect\n\u03a6T (t)D\u03a6(t), i.e., the population model \u00b1 the s.t.d. of\nthe deviation from the uncertainty of the \u03b2. The black dash line is the s.t.d of the deviation when we consider\nthe observation noise.\n\nN i(cid:88)\n\nand \u03a6(ti) = [\u03a6(ti\n\n1), \u03a6(ti\n\n2),\u00b7\u00b7\u00b7 , \u03a6(ti\n\nn)]T .\n\n\u03a3i =\n\n\u03a6(ti\n\nn)D\u03a6(ti\n\nn)T + \u03c32I,\n\nn=1\n\n(cid:81)k\nThe data likelihood for Y = {y1, y2,\u00b7\u00b7\u00b7 , yk} with T = {t1, t2,\u00b7\u00b7\u00b7 , tk} is then p(Y; T,M) =\ni=1 p(yi| ti;M). The maximum likelihood values of {\u03b1, D, \u03c3} are found using the Expectation-\nMaximization algorithm [6] with {\u03b21, \u03b22,\u00b7\u00b7\u00b7 , \u03b2k} considered as the latent variable:\n\nE-step:\n\nQ(M,Mg) = E{\u03b2i}(log p(Y,{\u03b2i}; T,M)|Y; T,Mg)\n\nM-step: M = arg maxM Q(M,Mg),\n\n(4)\n(5)\nwhere Mg stands for the model parameters estimated in previous step, and the expectation in the E-step is\nwith respect to the posterior distribution of on {\u03b2i} when Y is known and the model parameter is Mg. For\nthe linear mixed-effect model in Equation (1), the M-step can be given in a closed form. The details of the\nupdating equations are given by Laird et al. [6].\nWe use the linear mixed-effect model with polynomial basis functions\n\u03a6(t) = [1, t]T . We trained separate mixed-effect models for each of the\nsix measurements. For the four motor behavior measurements, we use\nthe logarithm of data to reduce the skew of the residuals. Figure 1 shows\nthe \ufb01t models for seconds and logical memory II, as the representatives of\nthe six measurements. The plots show the \ufb01xed effect regression \u03b1T \u03a6(t)\n(red curve), and the standard deviations arising from the random effects\n(green curves) and measurement noise (dashed black curve, see caption).\nThe data are the blue spaghetti plots. The plots con\ufb01rm that subjects that\nbecome impaired deteriorate faster than those who remain healthy.\nWith multiple classes (or component subpopulations), it is natural to use\na mixture of mixed-effect models. We have two components: one \ufb01t on\nthe normal group (denoted M0) and one \ufb01t on impaired group (denoted\n{\u03c00,M0, \u03c01,M1} to denote the parameters of this mixture, with \u03c00 and\n\u03c01 being the mixing proportions (prior) estimated from the training data.\nThe overall generative process for any individual (ti, yi) is summarized\nin Figure 2. Here zi \u2208 {0, 1} is the latent variable indicating which model component is used to generate\nyi.\n\nM1), with Mm = {\u03b1m, Dm, \u03c3m}, m = 0, 1. Here, we use (cid:102)M =\n\nFigure 2: The graphical model of\nthe mixture of mixed-effect models.\n\n3\n\n70809010011.522.533.54Agelog(seconds)70809010011.522.533.54Agelog(seconds)7080901000246810Age# of words7080901000246810Age# of words\f3 Hierarchical Fisher Kernel\n3.1 Fisher Kernel Background\nThe Fisher kernel [4] provides a way to extract discriminative features from the generative model. For any\n\u03b8-parameterized model p(x; \u03b8), the Fisher kernel between xi and xj is de\ufb01ned as\nK(xi, xj) = (\u2207\u03b8 log p(xi; \u03b8))T I\u22121\u2207\u03b8 log p(xj; \u03b8),\n\n(6)\n\nwhere I is the Fisher information matrix with the (n, m) entry\n\n(cid:90)\n\nIn,m =\n\n\u2202 log p(x; \u03b8)\n\n\u2202 log p(x; \u03b8)\n\nx\n\n\u2202\u03b8n\n\n\u2202\u03b8m\n\np(x; \u03b8)dx.\n\n(7)\n\nThe kernel entry K(xi, xj) can be viewed as the inner product of the natural gradient I\u22121\u2207\u03b8 log p(x; \u03b8) at\nxi and xj with metric I, and is invariant to re-parametrization of \u03b8. Jaakkola and Haussler [4] prove that a\nlinear classi\ufb01er based on the Fisher kernel performs at least as well as the generative model.\n\n3.2 Retaining the Fisher Information Matrix\nIn the bioinformatics literature [3] and for longitudinal data such as ours, p(xi; \u03b8) is different for each\nindividual owing to different sequence lengths, and (for longitudinal data) different sampling times ti. The\nintegral in Equation (7) must therefore include the distribution sequence lengths and observation times.\nWhere only sequence lengths differ, an empirical average can be used. However where observation times\nare non-uniform and vary considerably between individuals (as is the case here), there is insuf\ufb01cient data to\nform an estimate by empirical averaging.\nThe usual response to the dif\ufb01culty is to replace the Fisher information with the identity matrix [4]. This\nspoils the geometric structure, in particular the invariance of the the kernel K(xi, xj) under change of coor-\ndinates in the model parameter space (model re-parameterization). This is a signi\ufb01cant \ufb02aw: the coordinate\nsystem used to describe the model is immaterial and should not in\ufb02uence the value of K(xi, xj). For proba-\nbilistic kernel regression, the choice of metric is immaterial in the limit of large training sets [4]. However for\nour application, which uses a support vector machine (SVM), we found the difference cannot be neglected.\nIn our case, replacing Fisher information matrix with the identity matrix is grossly unsuitable. For the\nmixed-effect model with polynomial basis functions the Fisher score components associated with higher\norder terms (such as slope and curvature) are far larger than the entries associated with lower order term\n(such as intercept). Without the proper normalization provided by the Fisher information matrix, the kernel\nwill be dominated by higher order entries2. A principled extension of the Fisher kernel provided by our\nhierarchical model allows proper calculation of the Fisher information matrix.\n\n3.3 Hierarchical Fisher Kernel\nOur design of kernel is based on the generative hierarchy of mixture of mixed-effect models, in Figure 2. We\nnotice that the individual-speci\ufb01c information ti enter into this generative process at the last step, but the \u201cla-\ntent\u201d variables \u03b3i and zi are drawn from the Gaussian mixture model (GMM) \u02dc\u0398 = {\u03c00, \u03b10, D0, \u03c01, \u03b11, D1},\n\nwith p(zi, \u03b3i;(cid:101)\u0398) = \u03c0zip(\u03b3zi; \u03b1zi, Dzi).\n\nWe can thus build a standard Fisher kernel for the latent variables, and use it to induce a kernel on the\nobserved data. Denoting the latent variables by vi, the Fisher kernel between vi and vj is\n\nK(vi, vj) = (\u2207\u0398 log p(vi; \u03b8))T (Iv)\u22121\u2207\u03b8 log p(vj; \u0398),\n\n2Our experiments on the OBAS data show that replacing the Fisher information with the identity compromises\n\nclassi\ufb01er performance.\n\n4\n\n\fwhere the Fisher score \u2207 \u02dc\u0398 log p(vi; \u02dc\u0398) is a column vector\n; \u2202 log p\n\u2202D0\nand Iv is the well-de\ufb01ned Fisher information matrix for v:\n\n\u2207 \u02dc\u0398 log p(vi; \u02dc\u0398) = [ \u2202 log p\n(cid:90)\n\u2202\u03c00\n\n; \u2202 log p\n\u2202\u03b10\n\n; \u2202 log p\n\u2202\u03c01\n\n; \u2202 log p\n\u2202\u03b11\n\n; \u2202 log p\n\u2202D1\n\n]T ,\n\n\u2202 log p(v; \u02dc\u0398)\n\n\u2202 log p(v; \u02dc\u0398)\n\np(v| \u02dc\u0398)dv.\n\n(8)\n\nn,m =\nIv\n\n\u2202 \u02dc\u0398n\n\n\u2202 \u02dc\u0398m\n\nThe kernel for yi and yj is the expectation of K(vi, vj) given the observation yi and yj.\n\n(cid:90)(cid:90)\nK(yi, yj) = Evi,vj [K(vi, vj)| yi, yj; ti, tj, (cid:102)M] =\n\nv\n\nK(vi, vj)p(vi| yi; ti, (cid:102)M)p(vj| yj; tj, (cid:102)M)dvidvj\n\nWith different choices of latent variable v, we have three kernel design strategies in the following subsec-\ntions. This extension to the Fisher kernel, named hierarchical Fisher kernel (HFK), enables us to deal with\ntime series with irregular sampling and different sequence lengths. To our knowledge it has not been reported\nelsewhere in the literature.\n\nDesign A: vi = \u03b3i\nThis kernel design marginalizes out the higher level variable {zi} and constructs Fisher kernel between the\n{\u03b3i}. This generative process is illustrated in Figure 3 (left panel), which is the same graphical model in\nFigure 2 with latent variable zi marginalized out3. The Fisher kernel for \u03b3 is\n\nK(\u03b3i, \u03b3j) = (\u2207 \u02dc\u0398 log p(\u03b3i| \u02dc\u0398))T (I\u03b3)\u22121\u2207 \u02dc\u0398 log p(\u03b3i| \u02dc\u0398).\n\n(9)\n\nThe kernel between yi and yj as the expectation of K(\u03b3i, \u03b3j):\n\nK(yi, yj) = E\u03b3i,\u03b3j (K(\u03b3i, \u03b3j)| yi, yj; ti, tj, (cid:102)M)\n\n(cid:90)\n\u2207 \u02dc\u0398 log p(\u03b3i| \u02dc\u0398)p(\u03b3i| yi; ti, (cid:102)M)d\u03b3i)T (I\u03b3)\u22121\n\n(cid:90)\n\n= (\n\n\u2207 \u02dc\u0398 log p(\u03b3j| \u02dc\u0398)p(\u03b3j| yj; tj(cid:102)M)d\u03b3j. (11)\n(cid:82) \u2207 \u02dc\u0398 log p(\u03b3j| \u02dc\u0398)p(\u03b3j| yj; tj(cid:102)M)d\u03b3j\n\n(10)\n\nThe computational drawback is that the integral required to evaluate\nand Ir do not have an analytical solution. In our experiments, we estimated the integral with Monte-Carlo\nsampling.\n\nDesign B: vi = (zi, \u03b3i)\nThis design strategy takes both \u03b3i and zi as joint latent variable and build a Fisher kernel for them. The\ngenerative process, as summarized in Figure 3 (middle panel), gives the probability for latent variables\n\np(zi, \u03b3i; \u02dc\u0398) = \u03c0zip(\u03b3i; \u03b1zi, Dzi).\n\nThe Fisher kernel for the joint variable (\u03b3i, zi) is\n\nK((zi, \u03b3i), (zj, \u03b3j)) = (\u2207 \u02dc\u0398 log p(zi, \u03b3i; \u02dc\u0398))T (Iz,\u03b3)\u22121\u2207 \u02dc\u0398 log p(zi, \u03b3i; \u02dc\u0398),\n\n(12)\n\nwhere Iz,\u03b3 is the Fisher information matrix associated with distribution p(z, \u03b3; \u02dc\u0398). It can be shown that\n\nK((zi, \u03b3i), (zj, \u03b3j)) =\n\n1\n\u03c0zi\n\n\u03b4(zi, zj)(1 + Kzi(\u03b3i, \u03b3j))\n\n3Strictly speaking, we cannot sum out zi at this step since the group membership is used later in generating the\nobservation noise. However this is a reasonable approximation since the noise variance from M0 and M1 are similar.\n\n5\n\n\fwhere Km(\u03b3i, \u03b3j) is the Fisher kernel for \u03b3i associated with component m (= 0, 1)\n\nThe kernel for yi and yj is de\ufb01ned similarly as in Design A:\n\nKm(\u03b3i, \u03b3j) = (\u2207\u0398m log p(\u03b3i; \u03b1m, Dm))T I\u22121\n\nm \u2207\u0398m log p(\u03b3i; \u03b1m, Dm),\n\nK(yi, yj) = Ezi,\u03b3i,zj ,\u03b3j (K((zi, \u03b3i), (zj, \u03b3j))| yi, yj; ti, tj, (cid:102)M)\n\n(13)\n\n(14)\n\nwhere the integral can be evaluated analytically.\n\nDesign C: (cid:102)M = Mm, m = 0, 1\n\nThis design uses one mixed-effect component in-\nstead of the mixture as the generative model, as il-\nlustrated in Figure 3 (right panel). Although any\nsingle Mm is not a satisfying generative model\nfor the whole population, the resulting kernel is\nstill useful for classi\ufb01cation as follows. For either\nmodel, m = 0, 1, the Fisher score for the ith indi-\nvidual \u2207\u0398m log p(\u03b3i; \u0398m) describes how the prob-\nability p(\u03b3i; \u0398m) responds to the change of param-\neters \u0398m. This is a discriminative feature vector\nsince the likelihood of \u03b3i for individuals from dif-\nferent group are likely to have different response to\nthe change of parameters \u0398m. The kernel between\n\u03b3i and \u03b3j is Km(\u03b3i, \u03b3j) de\ufb01ned in Equation (13).\nAnd then the kernel for yi and yj:\n\nDesign A\n\nDesign B\n\nDesign C\n\nFigure 3: The graphical model of the mixture of\nmixed-effect models for Design A, B, and C.\n\nK(yi, yj) = E\u03b3i,\u03b3j (K(\u03b3i, \u03b3j)| yi, yj; ti, tj,Mm)\n\n(15)\nOur experiments show that the kernel based on the impaired group is signi\ufb01cantly better than others; we\ntherefore use this kernel as the representative of Design C. It is easy to see that the designed kernel is a\nspecial case of Design A or Design B when \u03c00 = 1 and \u03c01 = 0.\n3.4 Related Models\nMarginalized Kernel Our HFK is related to the marginalized kernel (MK) proposed by Tsuda et. al. [10].\nMK uses a distribution with discrete latent variable h (indicating the generating component) and observable\nx, which form a complete data pair x = (h, x). The kernel for observable xi and xj is de\ufb01ned as\n\n(16)\n\n(cid:101)K(xi, xj) = \u03b4(hi, hj)Khi(xi, xj),\n\n(17)\nwhere Khi(xi, xj) is the pre-de\ufb01ned kernel for observables associated the hi generative component. Equa-\n\nwhere (cid:101)K(xi, xj) is the joint kernel for complete data. Tsuda et. al. [10] uses the form:\ntion (17) says that (cid:101)K(xi, xj) takes the value of kernel de\ufb01ned for the mth component model if xi and xj\nare generated from the same component hi = hj = m; otherwise, (cid:101)K(xi, xj) = 0. HFK can be viewed as a\nand view (cid:101)K(xi, xj) as a generalization of kernel between hi and hj. Nevertheless HFK is different from the\n\nspecial case of the generalized marginalized kernel that allows continuous latent variables h. This is clear if\nwe re-write Equation (16) as\n\n(cid:101)K(xi, xj) = Ehi,hj ((cid:101)K(xi, xj)|xi, xj)\n\noriginal work in [10], in that MK requires existing kernels for observable, such as Kh(xi, xj) in Equation\n(17). In our problem setting, this kernel does not exist due to the different lengths of time series.\n\nP (hi|xi)P (hj|xj)(cid:101)K(xi, xj)\n\n(cid:101)K(xi, xj) =\n\n(cid:88)\n\n(cid:88)\n\nhi\n\nhj\n\n6\n\n\fProbability Product Kernel We can get a family of kernels by employing various kernel designs\nof K(vi, vj). The simplest example is to let K(vi, vj) = \u03b4(zi, zj), which immediately leads to\n\nK(yi, yj) = Evi,vj (K(vi, vj)| yi, yj; ti, tj, (cid:102)M) =\n\n(cid:80)\nm P (zi = m|yi; ti, (cid:102)M)P (zj = m|yj; tj, (cid:102)M),\n\nwhich is obviously related to the posterior probabilities of samples, and is essentially a special case of the\nprobability product kernels [5] proposed by Jebara et. al.\n4 Experiments\nPerformance Evaluation We use the empirical ROC curve (detection rate vs. false alarm rate) to eval-\nuate classi\ufb01ers. We compare different classi\ufb01ers using the area under the curve (AUC), and calculate the\nstatistical signi\ufb01cance following the method given by Pepe [9]. We tested the classi\ufb01ers on the \ufb01ve features:\nsteps, seconds, tappingD, tappingN, and logical memory II. The results of delayed-recall are omitted, they are\nvery close to those for logical memory II. The mixed-effect models for each feature were trained separately\nwith order-1 polynomials (linear) as the basis functions. For each feature, the kernels are used in support\nvector machines (SVM) for classi\ufb01cation, and the ROC is obtained by thresholding the classi\ufb01er output with\nvarying values. The classi\ufb01ers are evaluated by leave-one-out cross-validation, the left-out sample consisting\nof an individual subject\u2019s complete time series (which is also held out of the \ufb01tting of the generative model).\nClassi\ufb01ers For comparison, we also examined the following two classi\ufb01ers. First, we consider the likeli-\nhood ratio test based on mixed-effect models {M0,M1}. For any given observation (t, y), the likelihood\nthat it is generated by mixed-effect model Mm is given by p(y; t,Mm), which is de\ufb01ned similarly as in\nEquation (3). The classi\ufb01cation decision for a likelihood ratio classi\ufb01er is made by thresholding the ratio\np(y;t,M0)\np(y;t,M1). Second, we consider a feature extraction routine independent of any generative model. We sum-\nmarize each individual i with the least-square \ufb01t coef\ufb01cients for a d-degree polynomial regression model,\ndenoted as pi. To get a reliable \ufb01tting we only consider the case d = 1 since many individuals only have four\nor \ufb01ve observations. We use the coef\ufb01cients (normalized to their s.t.d.), denoted as \u02c6pi, as the feature vector,\nand build a RBF kernel Gij = exp(\u2212||\u02c6pi\u2212\u02c6pj||2\n), where s is the kernel width estimated with leave-one-out\ncross validation in our experiment. The obtained kernel matrix G will be referred to as LSQ kernel.\nResults We \ufb01rst compare three HFK designs, using the ROC curves plotted in Figure 4 (upper row). On\nall four motor tests, Design A and Design B are very much comparable except on tappingD, on which\nDesign A is marginally better than Design B with p = 0.136. Also on the motor tests, Design C is slightly\nbut consistently better than other two designs. On logical memory II (story recall), the three designs have\ncomparable performance. We thus use Design C as the representative of HFK, and compare it with the\nlikelihood ratio classi\ufb01er and SVM based on LSQ kernel, as shown in Figure 4 (lower row). On four motor\ntest, the classi\ufb01er based on HFK obviously out-performs the other two classi\ufb01ers, and on logical memory II,\nthe three classi\ufb01ers have very much comparable performance.\n5 Discussion\nFisher kernels derived from mixed-effect generative models retain the Fisher information matrix, and hence\nthe proper invariance of the kernel under change of coordinates in the model parameter space. In additional\nexperiments, classi\ufb01ers constructed with the proper kernel out-perform those constructed with the identity\nmatrix in place of the Fisher information on our data. For example, on seconds, the HKF (Deign C) achieves\nAUC = 0.7333, while the Fisher kernel computed with the identity matrix as metric on p(yi; ti,M) achieves\na AUC = 0.6873, with the p-value (Z-test) 0.0435.\nOur classi\ufb01ers built with Fisher kernels derived from mixed-effect models outperform those based solely\non the generative model (using likelihood ratio tests) for the motor task data, and are comparable on the\npsychometric tests. The hierarchical kernels also produce better classi\ufb01ers than a standard SVM using the\ncoef\ufb01cients of a least squares \ufb01t to the individual\u2019s data. This shows that the generative model provides real\nadvantage for classi\ufb01cation. The mixed-effect models capture both the population behavior (through \u03b1),\nand the statistical variability of the individual subject models (through the covariance of \u03b2). Knowledge of\n\n2s2\n\n2\n\n7\n\n\fsteps\n\nseconds\n\ntappingD\n\ntappingN\n\nlogical memory II\n\np1 =0.486, p2 =0.326\n\np1 =0.387, p2 =0.158\n\np1 =0.136, p2 =0.210\n\np1 =0.482, p2 =0.286\n\np1 =0.491, p2 =0.452\n\np1 =0.041, p2 =0.038\n\np1 =0.056, p2 =0.083\n\np1 =0.042, p2 =0.085\n\np1 =0.38, p2 =0.049\n\np1 =0.485, p2 =0.523\n\nFigure 4: Comparison of classi\ufb01ers. Upper row: Three HFK designs. The number in the parenthesis is the\np-value (Z-test) for the null-hypothesis \u201cthe AUC of Classi\ufb01er 1 is the same as the AUC of Classi\ufb01er 2\u201d.\nUpper row: Three HKF designs. p1: Design A vs. Design B, p2: Design C vs. Design A; Lower row:\nHFK & other classi\ufb01ers. p1: Design C vs. Likelihood ratio, p2 : Design C vs. LSQ kernel.\n\nthe statistics of the subject variability is extremely important for classi\ufb01cation: although not discussed here,\nclassi\ufb01ers based only on the population model (\u03b1) perform far worse than those presented here [7].\nAcknowledgements\nThis work was supported by Intel Corp. under the OHSU BAIC award. Milar Moore and to Robin Guariglia\nof the Layton Aging & Alzheimer\u2019s Disease Center gave invaluable help with data from the Oregon Brain\nAging Study. We thank Misha Pavel, Tamara Hayes, and Nichole Carlson for helpful discussion.\nReferences\n[1] R. Camicioli, D. Howieson, B. Oken, G. Sexton, and J. Kaye. Motor slowing precedes cognitive impairment in the\n\noldest old. Neurology, 50:1496\u20131498, 1998.\n\n[2] M. Green, J. Kaye, and M. Ball. The Oregon brain aging study: Neuropathology accompanying healthy aging in\n\nthe oldest old. Neurology, 54(1):105\u2013113, 2000.\n\n[3] T. Jaakkola, M. Diekhaus, and D. Haussler. Using the \ufb01sher kernel method to detect remote protein homologies.\n\n7th Intell. Sys. Mol. Biol., pages 149\u2013158, 1999.\n\n[4] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classi\ufb01ers. Technical report, Dept. of\n\nComputer Science, Univ. of California, 1998.\n\n[5] T. Jebara, R. Kondor, and A. Howard. Probability product kernels. Journal of Machine Learning Research, 5:819\u2013\n\n844, 2004.\n\n[6] N. Laird and J. Ware. Random-effects models for longitudinal data. Biometrics, 38(4):963\u2013974, 1982.\n[7] Z. Lu. Constrained Clustering and Cognitive Decline Detection. PhD thesis, OHSU, 2008.\n[8] S. Marquis, M. Moore, D. Howieson, G. Sexton, H. Payami, J. Kaye, and R. Camicioli. Independent predictors of\n\ncognitive decline in healthy elderly persons. Arch. Neurol., 59:601\u2013606, 2002.\n\n[9] M. Pepe. The Statistical Evaluation of Medical Tests for Classi\ufb01cation and Prediction. Oxford University Press,\n\nOxford, 2003.\n\n[10] K. Tsuda, T. Kin, and K. Asai. Marginalized kernels for biological sequences. Bioinformatics, 1(1):1\u20138, 2002.\n\n8\n\n00.20.40.60.8100.10.20.30.40.50.60.70.80.91False Alarm RateDetection Rate DesignADesignBDesignC00.20.40.60.8100.10.20.30.40.50.60.70.80.91False Alarm RateDetection Rate DesignADesignBDesignC00.20.40.60.8100.10.20.30.40.50.60.70.80.91False Alarm RateDetection Rate DesignADesignBDesignC00.20.40.60.8100.10.20.30.40.50.60.70.80.91False Alarm RateDetection Rate DesignADesignBDesignC00.20.40.60.8100.10.20.30.40.50.60.70.80.91False Alarm RateDetection Rate DesignADesignBDesignC00.20.40.60.8100.10.20.30.40.50.60.70.80.91False Alarm RateDetection Rate DesignCLSQKLKHD00.20.40.60.8100.10.20.30.40.50.60.70.80.91False Alarm RateDetection Rate DesignCLSQKLKHD00.20.40.60.8100.10.20.30.40.50.60.70.80.91False Alarm RateDetection Rate DesignCLSQKLKHD00.20.40.60.8100.10.20.30.40.50.60.70.80.91False Alarm RateDetection Rate DesignCLSQKLKHD00.20.40.60.8100.10.20.30.40.50.60.70.80.91False Alarm RateDetection Rate DesignCLSQKLKHD\f", "award": [], "sourceid": 372, "authors": [{"given_name": "Zhengdong", "family_name": "Lu", "institution": null}, {"given_name": "Jeffrey", "family_name": "Kaye", "institution": null}, {"given_name": "Todd", "family_name": "Leen", "institution": null}]}