{"title": "Regularized Modal Regression with Applications in Cognitive Impairment Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 1448, "page_last": 1458, "abstract": "Linear regression models have been successfully used to function estimation and model selection in high-dimensional data analysis. However, most existing methods are built on least squares with the mean square error (MSE) criterion, which are sensitive to outliers and their performance may be degraded for heavy-tailed noise. In this paper, we go beyond this criterion by investigating the regularized modal regression from a statistical learning viewpoint. A new regularized modal regression model is proposed for estimation and variable selection, which is robust to outliers, heavy-tailed noise, and skewed noise. On the theoretical side, we establish the approximation estimate for learning the conditional mode function, the sparsity analysis for variable selection, and the robustness characterization. On the application side, we applied our model to successfully improve the cognitive impairment prediction using the Alzheimer\u2019s Disease Neuroimaging Initiative (ADNI) cohort data.", "full_text": "Regularized Modal Regression with Applications in\n\nCognitive Impairment Prediction\n\nXiaoqian Wang1, Hong Chen1, Weidong Cai2, Dinggang Shen3, Heng Huang1\u2217\n1 Department of Electrical and Computer Engineering, University of Pittsburgh, USA\n\n2School of Information Technologies, University of Sydney, Australia\n\n3 Department of Radiology and BRIC, University of North Carolina at Chapel Hill, USA\n\nxqwang1991@gmail.com,chenh@mail.hzau.edu.cn\n\ntom.cai@sydney.edu.au,dinggang_shen@med.unc.edu,heng.huang@pitt.edu\n\nAbstract\n\nLinear regression models have been successfully used to function estimation and\nmodel selection in high-dimensional data analysis. However, most existing methods\nare built on least squares with the mean square error (MSE) criterion, which\nare sensitive to outliers and their performance may be degraded for heavy-tailed\nnoise. In this paper, we go beyond this criterion by investigating the regularized\nmodal regression from a statistical learning viewpoint. A new regularized modal\nregression model is proposed for estimation and variable selection, which is robust\nto outliers, heavy-tailed noise, and skewed noise. On the theoretical side, we\nestablish the approximation estimate for learning the conditional mode function,\nthe sparsity analysis for variable selection, and the robustness characterization. On\nthe application side, we applied our model to successfully improve the cognitive\nimpairment prediction using the Alzheimer\u2019s Disease Neuroimaging Initiative\n(ADNI) cohort data.\n\n1\n\nIntroduction\n\nModal regression [21, 5] has gained increasing attention recently due to its effectiveness on function\nestimation and robustness to outliers and heavy-tailed noise. Unlike the traditional least-square\nestimator pursuing the conditional mean, modal regression aims to estimate the conditional mode of\noutput Y given the input X = x. It is well known that the conditional modes can reveal the structure\nof outputs and the trends of observation, which is missed by the conditional mean [29, 4]. Thus,\nmodal regression often achieves better performance than the traditional least square regression in\npractical applications.\nThere are some studies for modal regression with (semi-)parametric or nonparametric methods, such\nas [29, 28, 4, 6]. For parametric approaches, a parametric form is required for the global conditional\nmode function. Recent works in [29, 28] belong to this category, where the method in [28] is based\non linear mode function assumption and the algorithm in [29] is associated with the local polynomial\nregression. For non-parametric approaches, the conditional mode is usually derived by maximizing a\nconditional density or a joint density. Typical work for this setting is established in [4], where a local\nmodal regression is proposed based on kernel density estimation and theoretical analysis is provided\nto characterize asymptotic error bounds.\nMost of the above mentioned works consider the asymptotic theory on the conditional mode function\nestimation. Recently, several studies on variable selection under modal regression were also con-\nducted in [30, 27]. These approaches addressed the problem from statistical theory viewpoint (e.g.,\n\n\u2217X. Wang and H. Chen made equal contributions to this paper. H. Huang is the corresponding author.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fasymptotic normality) and were implemented by modi\ufb01ed EM algorithm. Although these studies\nprovide us good understanding for modal regression, the following problems still remain unclear\nin theory and applications. Can we design new modal regression following the line of structural\nrisk minimization? Can we provide its statistical guarantees and computing algorithm for designed\nmodel? This paper focuses on answering the above questions.\nTo illustrate the effectiveness of our model, we looked into a practical problem, i.e., cognitive\nimpairment prediction via neuroimaging data. As the most common cause of dementia, Alzheimer\u2019s\nDisease (AD) imposes extensive and complex impact on human thinking and behavior. Accurate and\nautomatic study of the relationship between brain structural changes and cognitive impairment plays\na crucial role in early diagnosis of AD. In order to increase the diagnostic capabilities, neuroimaging\nprovides an effective approach for clinical detection and treatment response monitoring of AD [13].\nSeveral cognitive tests were presented to assess the individual\u2019s cognitive level, such as Mini-Mental\nState Examination (MMSE) [8] and Trail Making Test (TMT) [1]. With the development of these\ntechniques, a wide range of work employed regression models to study the correlations between\nneuroimaging data and cognitive measures [23, 16, 26, 25, 24].\nHowever, existing methods use mean regression models based on the least-square estimator to predict\nthe relationship between neuroimaging features and cognitive assessment, which may fail when\nthe noise in the data is heavy-tailed or skewed. According to the complex data collection process\n[13], the assumption of symmetric noise may not be guaranteed in biomedical data. Under such a\ncircumstance, modal regression model proves to be more appropriate due to its robustness to outliers,\nheavy-tailed noise, and skewed noise. We applied our method to the ADNI cohort for the association\nstudy between neuroimaging features and cognitive assessment. Experimental results illustrated the\neffectiveness of our model. Moreover, with sparse constraints, our model found several imaging\nfeatures that have been reported to be crucial to the onset and progression of AD. The replication of\nthese results further support the validity of our model.\nOur main works can be summarized as below:\n1) Following the Tikhonov regularization and kernel density estimation, we develop a new Regularized\nModal Regression (RMR) for estimating the conditional mode function and selecting informative\nvariables, which can be considered as a natural extension of Lasso [22] and can be implemented\nef\ufb01ciently by half-quadratic minimization methods.\n2) Learning theory analysis is established for RMR from three aspects: approximation ability, sparsity,\nand robustness, which provide the theoretical foundations of the proposed approach.\n3) By applying our RMR model to the ADNI cohort, we reveal interesting \ufb01ndings in cognitive\nimpairment prediction of Alzheimer\u2019s disease.\n\n2 Regularized Modal Regression\n\n2.1 Modal regression\nWe consider learning problem with input space X \u2282 Rp and output space Y \u2282 R. Let pY |X=x be the\nconditional density of Y \u2208 Y for given X = x \u2208 X . In the prediction of cognitive assessment, we\ndenote the neuroimaging data for the i-th sample as xi and the cognitive measure for the i-th sample\nas yi. Suppose that training samples z = {(xi, yi)}n\ni=1 \u2282 X \u00d7 Y are generated independently by:\nY = f\u2217(X) + \u03b5,\n(1)\nwhere mode(\u03b5|X = x) = arg max\np\u03b5|X (t|X = x) = 0 for any x \u2208 X . Here, p\u03b5|X, as the\nconditional density of \u03b5 conditioned on X, is well de\ufb01ned. Then, the target function of modal\nregression can be written as:\n\nt\n\nf\u2217(x) = mode(Y |X = x) = arg max\n\n(2)\nTo assure f\u2217 is well de\ufb01ned on X , we require that the existence and uniqueness of pY |X (t|X = x)\nfor any given x \u2208 X . The relationship (2) means f\u2217 is the maximum of the conditional density\npY |X, and also equals to maximize the joint density pX,Y [4, 29, 28]. Here, we formulate the modal\nregression following the dimension-insensitive statistical learning framework [7].\n\npY |X (t|X = x),\u2200x \u2208 X .\n\nt\n\n2\n\n\f(cid:90)\n\nFor feasibility, we denote \u03c1 on X \u00d7 Y as the intrinsic distribution for data generated by (1), and\ndenote \u03c1X as the corresponding marginal distribution on X . It has been proved in Theorem 3 [6] that\nf\u2217 is the maximizer of\n\nR(f ) =\n\npY |X (f (x)|X = x)d\u03c1X (x)\n\nX\n\n(3)\nover all measurable function. Hence, we can adopt R(f ) as the evaluation measure of modal\nregression estimator f : X \u2192 R. However, we can not get the estimator directly by maximizing\nthis criterion since pY |X and \u03c1X are unknown. Recently, Theorem 5.1 in [6] shows R(f ) = p\u03b5f (0),\nwhere p\u03b5f is the density function of random variable \u03b5f = Y \u2212 f (X). Then, the problem of\nmaximizing R(f ) over some hypothesis spaces can be transformed to maximize the density of \u03b5f at\n0. This density p\u03b5f can be estimated by nonparametric kernel density estimation.\nFor a kernel K\u03c3 : R \u00d7 R \u2192 R+, we denote its representing function \u03c6( u\u2212u(cid:48)\n\n\u03c3 ) = K\u03c3(u, u(cid:48)), which\nR \u03c6(u)du = 1. Typical examples\nof kernel include Gaussian kernel, Epanechnikov kernel, quadratic kernel, triwight kernel, and\nsigmoid function. The empirical estimation of R(f ) (also p\u0001f (0)) can be obtained by kernel density\nestimation, which is de\ufb01ned as:\n\nusually satis\ufb01es \u03c6(u) = \u03c6(\u2212u), \u03c6(u) \u2264 \u03c6(0) for any u \u2208 R and(cid:82)\nn(cid:88)\n\nyi \u2212 f (xi)\n\nn(cid:88)\n\nR\u03c3\nz (f ) =\n\n1\nn\u03c3\n\ni=1\n\nK\u03c3(yi \u2212 f (xi), 0) =\n\n1\nn\u03c3\n\n\u03c6(\n\ni=1\n\n).\n\n\u03c3\n\nHence, the approximation of f\u2217 can be found by learning algorithms associated with R\u03c3\ntheory, for any f : X \u2192 R, the expectation version of R\u03c3\ny \u2212 f (x)\n\nz (f ) is:\n\n(cid:90)\n\nz (f ). In\n\nR\u03c3(f ) =\n\n1\n\u03c3\n\n\u03c6(\n\nX\u00d7Y\n\n\u03c3\n\n)d\u03c1(x, y).\n\nIn particular, there holds R(f ) \u2212 R\u03c3(f ) \u2192 0 as \u03c3 \u2192 0 [6].\n\n2.2 Modal regression with coef\ufb01cient-based regularization\nIn this paper, we assume that f\u2217(x) = mode(Y |X = x) = wT\u2217 x for some w\u2217 \u2208 Rp. Following the\nideas of ridge regression and Lasso [22], we consider the robust linear estimator for learning the\nconditional mode function.\nLet F be a linear hypothesis space de\ufb01ned by:\n\nF = {f (x) = wT x : w = (w1, ..., wp) \u2208 Rp, x \u2208 X}.\n\nFor any given positive tuning parameters {\u03c4j}p\n\nj=1, we denote:\n\n(cid:110) p(cid:88)\n\nj=1\n\n\u2126(f ) = inf\n\n\u03c4j|wj|q : f (x) = wT x, q \u2208 [1, 2]\n\n(cid:110)R\u03c3\n\nfz = arg max\n\nf\u2208F\n\nz (f ) \u2212 \u03bb\u2126(f )\n\n(cid:111)\n\n.\n\n(cid:111)\n\n,\n\nj=1\n\nGiven training set z, the regularized modal regression (RMR) can be formulated as below:\n\n(4)\n\n(5)\n\nwhere regularization parameter \u03bb > 0 is used to balance the modal regression measure and hypothesis\nspace complexity. It is easy to deduce that fz(x) = wT\nyi \u2212 wT xi\n\n(cid:110) 1\n\np(cid:88)\n\nn(cid:88)\n\n\u03c4j|wj|q(cid:111)\n\n.\n\nz x with\n) \u2212 \u03bb\n\nwz = arg max\n\nw\u2208Rp\n\n\u03c6(\n\nn\u03c3\n\ni=1\n\n\u03c3\n\nWhen \u03c4j \u2261 1 for 1 \u2264 j \u2264 p and q = 1, (5) can be considered as an natural extension of Lasso in\n[22] from learning the conditional mean function to estimating the conditional mode function. When\n\u03c4j \u2261 1 for 1 \u2264 j \u2264 p and q = 2, (5) also can be regarded as the corresponding version of ridge\nregression by replacing the MSE criterion with modal regression criterion. In particular, when K\u03c3 is\nGaussian kernel and \u03c4j \u2261 1 for 1 \u2264 j \u2264 p, (5) can be rewritten as:\n\n(cid:110) 1\n\nn(cid:88)\n\nn\u03c3\n\ni=1\n\n(cid:110) (yi \u2212 wT xi)2\n\n(cid:111) \u2212 \u03bb(cid:107)w(cid:107)q\n\nq\n\n(cid:111)\n\n,\n\n\u03c32\n\nexp\n\nwz = arg max\n\nw\u2208Rp\n\nwhich is equivalent to correntropy regression under maximum correntropy criterion [19, 9, 7].\n\n3\n\n\f2.3 Optimization algorithm\n\nWe employ the half-quadratic (HQ) theory [18] in the optimization. For a convex problem min\nit is equivalent to solve the following half-quadratic reformulation:\n\ns\n\nu(s),\n\nwhere Q(s, t) is quadratic for any t \u2208 R and v : R \u2192 R satis\ufb01es:\n\nmin\ns,t\n\nQ(s, t) + v(t),\n\nu(s) = min\n\nt\n\nQ(s, t) + v(t),\u2200s \u2208 R.\n\nSuch a dual potential function v can be determined via convex conjugacy as shown below.\nAccording to the convex optimization theory [20], for a closed convex function f (a), there exists a\nconvex function g(b), such that:\n\nf (a) = max\n\nb\n\n(ab \u2212 g(b)),\n\nwhere g is the conjugate of f, i.e., g = f (cid:63). Symmetrically, it is easy to prove f = g(cid:63).\n(ab \u2212 g(b)), we have arg max\n\nTheorem 1 For a closed convex function f (a) = max\nf(cid:48)(a) for any a \u2208 R.\n\nb\n\nb\n\n(ab \u2212 g(b)) =\n\nWhen K\u03c3 is Gaussian kernel, the optimization steps can be found in [9]. Here we take Epanechnikov\nkernel (a.k.a., parabolic kernel) as an example to show the optimization of Problem (5) via HQ theory.\nThe kernel-induced representing function of Epanechnikov kernel is \u03c6(e) = 3\nDe\ufb01ne a closed convex function f as:\n\n4 (1 \u2212 e2)1[|e|\u22641].\n\n(cid:26) 3\n4 (1 \u2212 a), 0 \u2264 a \u2264 1\na \u2265 1.\n\n0,\n\nf (a) =\n\n(ab \u2212 g(b)) and \u03c6(e) = f (e2) = max\nThere exists a convex function g such that f (a) = max\ng(b)). Thus, when \u03c4j \u2261 1 for 1 \u2264 j \u2264 p, the optimization problem (5) can be rewritten as:\n\nb\n\nb\n\n(e2b \u2212\n\n(cid:110) 1\n\nn(cid:88)\n\n(cid:16)\n\nn\u03c3\n\ni=1\n\nmax\n\nw\u2208Rp,b\u2208Rn\n\nyi \u2212 wT xi\n\n\u03c3\n\nbi(\n\n)2 \u2212 g(bi)\n\n(cid:17) \u2212 \u03bb\n\np(cid:88)\n\nj=1\n\n\u03c4j|wj|q(cid:111)\n\n.\n\n(6)\n\nProblem (6) can be easily optimized via alternating optimization algorithm. Note that according\nto Theorem 1, when w is \ufb01xed, b can be updated as bi = f(cid:48)(( yi\u2212wT xi\nfor\ni = 1, 2, . . . , n. For the space limitation, we provide the proof of Theorem 1 and the optimization\nsteps of RMR in the supplementary material.\n\n)2) = \u2212 3\n\n[| yi\u2212wT xi\n\n|\u22641]\n\n1\n\n\u03c3\n\n4\n\n\u03c3\n\n3 Learning Theory Analysis\n\nThis section presents the theoretical foundations of RMR from approximation ability, variable sparsity,\nand algorithmic robustness. Detail proofs of these results can be found in the supplementary material.\n\n3.1 Approximation ability analysis\n\n\u03c6(0) < \u221e, 2) \u03c6 is Lipschitz continuous with constant L\u03c6, 3)(cid:82)\n\nBesides the linear requirement for the conditional mode function, we also need some basic conditions\non the kernel-induced representing function \u03c6 [6, 28].\nAssumption 1 The representing function \u03c6 satis\ufb01es the following conditions: 1) \u2200u \u2208 R, \u03c6(u) \u2264\nR u2\u03c6(u)du < \u221e.\nIt is easy to verify that most of kernels used for density estimation satisfy the above conditions,\ne.g., Gaussian kernel, Epanechnikov kernel, quadratic kernel, etc. Since RMR is associated with\nR\u03c3\nz (f ), we need to establish quantitative relationship between R\u03c3(f ) and R(f ). Recently, the modal\nregression calibration has been illustrated in Theorem 10 [6] under the following restrictions on the\nconditional density p\u03b5|X.\n\nR \u03c6(u)du = 1 and(cid:82)\n\n4\n\n\fAssumption 2 The conditional density p\u03b5|X is second-order continuously differentiable and uniform\nbounded.\nNow, we present the approximation bound on R(f\u2217) \u2212 R(fz).\nTheorem 2 Let (cid:107)x(cid:107) q\nq \u2208 (1, 2], by taking \u03bb = \u03c32 = O(n\n\n\u2264 a for q \u2208 (1, 2] for any x \u2208 X and f\u2217 \u2208 F. Under Assumptions 1-2, for\n\n4q+3 ), we have:\n\n\u2212 q\n\nq\u22121\n\nR(f\u2217) \u2212 R(fz) \u2264 C log(4/\u03b4)n\n\n\u2212 q\n\n4q+3\n\nwith con\ufb01dence at least 1 \u2212 \u03b4. In particular, for q = 1 and (cid:107)x(cid:107)\u221e \u2264 a, choosing \u03bb = \u03c32 = ( ln p\nwe have:\n\nn ) 1\n7 ,\n\nR(f\u2217) \u2212 R(fz) \u2264 C log(4/\u03b4)\n\n(cid:16) ln p\n\n(cid:17) 1\n\n7\n\nn\n\nwith con\ufb01dence at least 1 \u2212 \u03b4, Here C1, C2 is a constant independent of n, \u03b4.\nTheorem 2 shows that the excess risk of R(f\u2217) \u2212 R(fz) \u2192 0 with the polynomial decay and the\nestimation consistency is guaranteed as n \u2192 \u221e. Moreover, under Assumption 3 in [6], we can derive\nthat fz tends to f\u2217 with approximation order O(n\n7 ) for q = 1.\nAlthough approximation analysis has been provided for modal regression in [6, 28], both of them are\nlimited to the empirical risk minimization. This is different from our result for regularized modal\nregression under structural risk minimization.\n\n4q+3 ) for q \u2208 (1, 2] and O( ln p\n\u2212 q\n\nn ) 1\n\n3.2 Sparsity analysis\n\nTo characterize the variable selection ability of RMR, we \ufb01rst present the properties for nonzero\ncomponent of wz.\nTheorem 3 Assume that \u03c6 is differentiable for any t \u2208 R. For j \u2208 {1, 2, ..., p} satisfying wzj (cid:54)= 0,\nthere holds:\n\n(cid:12)(cid:12)(cid:12) 1\n\nn\u03c32\n\nn(cid:88)\n\ni=1\n\nyi \u2212 fz(xi)\n\n\u03c6(cid:48)(\n\n)xij\n\n\u03c3\n\np\u03bb\u03c4j|wzj|p\u22121\n\n.\n\n2\n\n(cid:12)(cid:12)(cid:12) =\n\nObserve that the condition on \u03c6 holds true for Gaussian kernel, sigmoid function, and logistic function.\nTheorem 3 demonstrates the necessary condition for the non-zero wzj. Without loss of generality, we\nset S0 = {1, 2, ..., p0} as the index set of truly informative variables and denote Sz = {j : wzj (cid:54)= 0}\nas the set of identi\ufb01ed informative variables by RMR in (4).\nTheorem 4 Assume that (cid:107)x(cid:107)\u221e \u2264 a for any x \u2208 X and \u03bb\u03c4j \u2265 (cid:107)\u03c6(cid:48)(cid:107)\u221e\u03c3 for any j > p0. Then, for\nRMR (4) with q = 1, there holds Sz \u2282 S0 for all z \u2208 (X \u00d7 Y)n.\nTheorem 4 assures that RMR has the capacity to identify the truly informative variable in theory.\nCombining Theorem 4 and Theorem 2, we provide the asymptotic theory of RMR on estimation and\nmodel selection.\n\n3.3 Robustness analysis\n\nTo quantify the robustness of RMR, we calculate its \ufb01nite sample breakdown point, which re\ufb02ects the\nlargest amount of contamination points that an estimator can tolerate before returning arbitrary values\n[11, 12]. Recently, this index has been used to investigate the robustness of modal linear regression\n[28] and kernel-based modal regression [6].\nRecall that the derived weight wz de\ufb01ned in (5) is dependent on any given sampling set z =\n{(xi, yi)}n\nj=1 \u2282 X \u00d7 Y, we obtain the\ncorrupted sample set z \u222a z(cid:48). For given \u03bb, \u03c3,{\u03c4j}p\nj=1, we denote wz\u222az(cid:48) be the maximizer of (5). Then,\nthe \ufb01nite sample breakdown point of wz is de\ufb01ned as:\n\ni=1. By adding m arbitrary points z(cid:48) = {(xn+j, yn+j)}m\n\n(cid:110) m\n\n(cid:107)wz\u222az(cid:48)(cid:107)2 = \u221e(cid:111)\n\n.\n\n\u0001(wz) = min\n1\u2264m\u2264n\n\nn + m\n\n: sup\nz(cid:48)\n\n5\n\n\fTheorem 5 Assume that \u03c6(u) = \u03c6(\u2212u) and \u03c6(t) \u2192 0 as t \u2192 \u221e. For given \u03bb, \u03c3,{\u03c4j}p\ndenote:\n\nj=1, we\n\nn(cid:88)\n\ni=1\n\nM =\n\n1\n\n\u03c6(0)\n\n\u02dcyi \u2212 fz(xi))\n\n\u03c3\n\n\u03c6(\n\n) \u2212 \u03bb\u03c3(\u03c6(0))\u22121\u2126(fz).\n\nThen the \ufb01nite sample breakdown point of wz in (5) is \u0001(wz) = m\u2217\nis the smallest integer not less than M.\n\nn+m\u2217 , where m\u2217 \u2265 (cid:100)M(cid:101) and (cid:100)M(cid:101)\n\nFrom Theorem 5, we know that the \ufb01nite breakdown point of RMR depends on \u03c6, \u03c3, and the sample\ncon\ufb01guration, which is similar with re-descending M-estimator and recent analysis for modal linear\nregression in [28]. As illustrated in [11, 12], the \ufb01nite sample breakdown point is high when the\nbandwidth \u03c3 only depends on the training samples. Hence, RMR can achieve satisfactory robustness\nwhen \u03bb, \u03c4j are chosen properly and \u03c3 is determined by data-driven techniques.\n\n4 Experimental Analysis\n\nIn this section, we conduct experiments on both toy data, benchmark data as well as the ADNI\ncohort data to evaluate our RMR model. We compare several regression methods in the experiments,\nincluding: LSR (traditional mean regression based on the least square estimator), LSR-L2 (LSR with\nsquared (cid:96)2-norm regularization, i.e., ridge regression) LSR-L1 (LSR with (cid:96)1-norm regularization),\nMedianR (median regression), HuberR (regression with huber loss), RMR-L2 (RMR with squared\n(cid:96)2-norm regularization), and RMR-L1 (RMR with (cid:96)1-norm regularization).\nFor evaluation, we calculate root mean square error (RMSE) between the predicted value and\nground truth in out-of-sample prediction. The RMSE value is normalized via Frobenius norm of\nthe ground truth matrix. We employ 2-fold cross validation and report the average performance for\neach method. For each method, we set the hyper-parameter of the regularization term in the range of\n{10\u22124, 10\u22123.5,\n. . . , 104}. We tune the hyper-parameters via 2-fold cross validation on the training\ndata and report the best parameter w.r.t. RMSE of each method. For RMR methods, we adopt the\nEpanechnikov kernel and set the bandwidth as \u03c3 = max(|y \u2212 wT x|).\n\n4.1 Performance comparison on toy data\n\nFollowing the design in [28], we generate the toy data by sampling i.i.d. from the model: Y =\n\u22122 + 3X + \u03c4 (X)\u0001, where X \u223c U(0, 1), \u03c3(X) = 1 + 2X and \u0001 \u223c 0.5N (\u22122, 32) + 0.5N (2, 12).\nWe can derive that E(\u0001) = 0, Mode(\u0001) = 1.94 and Median(\u0001) = 1, hence the conditional mean\nregression function of the toy data is E(Y |X) = \u22122 + 3X, the conditional median function is\nMedian(Y |X) = 1 + 5X, while the conditional mode is Mode(Y |X) = \u22120.06 + 6.88X.\nWe consider three different number of samples: 100,200,500, and repeat the experiments 100 times\nfor each setting. We present the RMSE in Table 1, which shows that RMR models get lower RMSE\nvalues than all comparing methods. It indicates that RMR models make better estimation of the\noutput when the noise in data is skewed and relatively heavy-tailed. Moreover, we compare the\ncoverage probabilities for prediction intervals centered around the predicted value from each method.\nWe set the length of coverage intervals to be {0.1\u03bd, 0.2\u03bd, 0.3\u03bd} respectively with \u03bd = 3 being the\napproximate standard error of \u0001. From Table 2 we can \ufb01nd that RMR models provide larger coverage\nprobabilities than the counterparts.\n\n4.2 Performance comparison on benchmark data\n\nHere we present the comparison results on six benchmark datasets from UCI repository [15] and\nStatLib2, which include: slumptest, forest\ufb01re, bolts, cloud, kidney, and lupus. We summarize the\nresults in Table 3. From the comparison we notice that RMR models tend to perform better on all\ndatasets. Also, RMR-L1 obtains lower RMSE value since the RMR-L1 model is more robust with\nthe (cid:96)1-norm regularization term.\n\n2http://lib.stat.cmu.edu/datasets/\n\n6\n\n\fTable 1: Average RMSE and standard deviation with different number (n) of toy samples.\n\nLSR\n\nLSR-L2\nLSR-L1\nMedianR\nHuberR\nRMR-L2\nRMR-L1\n\nn=100\n\n0.9687\u00b10.0699\n0.9671\u00b10.0685\n0.9672\u00b10.0685\n0.9944\u00b10.0806\n0.9725\u00b10.0681\n0.9663\u00b10.0683\n0.9662\u00b10.0679\n\nn=200\n\n0.9477\u00b10.0294\n0.9469\u00b10.0284\n0.9473\u00b10.0288\n0.9568\u00b10.0350\n0.9485\u00b10.0296\n0.9466\u00b10.0282\n0.9465\u00b10.0281\n\nn=500\n\n0.9495\u00b10.0114\n0.9495\u00b10.0114\n0.9495\u00b10.0114\n0.9542\u00b10.0120\n0.9502\u00b10.0116\n0.9493\u00b10.0114\n0.9492\u00b10.0114\n\nTable 2: Average coverage possibilities and standard deviation on toy data.\n\nLSR\n\nLSR-L2\nLSR-L1\nMedianR\nHuberR\nRMR-L2\nRMR-L1\n\nLSR\n\nLSR-L2\nLSR-L1\nMedianR\nHuberR\nRMR-L2\nRMR-L1\n\nLSR\n\nLSR-L2\nLSR-L1\nMedianR\nHuberR\nRMR-L2\nRMR-L1\n\nn=100\n\n0.0730\u00b10.0247\n0.0753\u00b10.0247\n0.0747\u00b10.0246\n0.0563\u00b10.0255\n0.0710\u00b10.0258\n0.0760\u00b10.0254\n0.0760\u00b10.0255\n0.1313\u00b10.0338\n0.1337\u00b10.0334\n0.1337\u00b10.0337\n0.1087\u00b10.0351\n0.1237\u00b10.0347\n0.1340\u00b10.0336\n0.1343\u00b10.0340\n0.1923\u00b10.0402\n0.1940\u00b10.0415\n0.1940\u00b10.0415\n0.1750\u00b10.0414\n0.1873\u00b10.0389\n0.1943\u00b10.0420\n0.1950\u00b10.0406\n\nn=200\n\n0.0702\u00b10.0166\n0.0731\u00b10.0155\n0.0719\u00b10.0161\n0.0626\u00b10.0124\n0.0698\u00b10.0160\n0.0740\u00b10.0161\n0.0742\u00b10.0156\n0.1450\u00b10.0255\n0.1461\u00b10.0251\n0.1458\u00b10.0258\n0.1331\u00b10.0239\n0.1442\u00b10.0257\n0.1477\u00b10.0256\n0.1481\u00b10.0247\n0.2142\u00b10.0342\n0.2165\u00b10.0331\n0.2153\u00b10.0334\n0.2031\u00b10.0299\n0.2132\u00b10.0333\n0.2179\u00b10.0327\n0.2177\u00b10.0323\n\nn=500\n\n0.0702\u00b10.0106\n0.0709\u00b10.0108\n0.0706\u00b10.0106\n0.0654\u00b10.0097\n0.0694\u00b10.0101\n0.0719\u00b10.0111\n0.0720\u00b10.0111\n0.1430\u00b10.0193\n0.1429\u00b10.0196\n0.1430\u00b10.0193\n0.1377\u00b10.0182\n0.1421\u00b10.0188\n0.1441\u00b10.0199\n0.1441\u00b10.0198\n0.2150\u00b10.0229\n0.2156\u00b10.0222\n0.2153\u00b10.0226\n0.2095\u00b10.0233\n0.2144\u00b10.0224\n0.2168\u00b10.0220\n0.2167\u00b10.0219\n\n0.1\u03bd\n\n0.2\u03bd\n\n0.3\u03bd\n\nTable 3: Average RMSE and standard deviation on benchmark data.\n\ncloud\n\nkidney\n\nlupus\n\nslumptest\n\nforest\ufb01re\n\nbolts\n\nLSR 0.2689\u00b10.0295 0.9986\u00b10.0874 0.4865\u00b10.0607 0.6178\u00b10.0190 0.5077\u00b10.0264 0.8646\u00b10.3703\nLSR-L2 0.2616\u00b10.0266 0.9822\u00b10.0064 0.4687\u00b10.0137 0.5782\u00b10.0029 0.5106\u00b10.0219 0.8338\u00b10.3282\nLSR-L1 0.2571\u00b10.0277 0.9822\u00b10.0079 0.4713\u00b10.0172 0.5802\u00b10.0043 0.5196\u00b10.0089 0.8408\u00b10.3366\nMedianR 0.2810\u00b10.0024 0.9964\u00b10.0050 0.4436\u00b10.0232 0.6457\u00b10.0301 0.5432\u00b10.0160 1.2274\u00b10.6979\nHuberR 0.2669\u00b10.0268 0.9874\u00b10.0299 0.4841\u00b10.0661 0.6178\u00b10.0190 0.5447\u00b10.0270 0.9198\u00b10.4226\nRMR-L2 0.2538\u00b10.0185 0.9817\u00b10.0093 0.4782\u00b10.0107 0.5702\u00b10.0131 0.4871\u00b10.0578 0.8071\u00b10.3053\nRMR-L1 0.2517\u00b10.0240 0.9802\u00b10.0198 0.3298\u00b10.1313 0.5663\u00b10.0305 0.4989\u00b10.0398 0.7885\u00b10.2910\n\n7\n\n\fTable 4: Average RMSE and standard deviation on the ADNI data.\n\nLSR\n\nLSR-L2\nLSR-L1\nMedianR\nHuberR\nRMR-L2\nRMR-L1\n\nFluency\n\n0.3856\u00b10.0034\n0.3269\u00b10.0069\n0.3295\u00b10.0035\n0.4164\u00b10.0291\n0.3856\u00b10.0034\n0.3256\u00b10.0049\n0.3269\u00b10.0057\n\nADAS\n\n0.4397\u00b10.0112\n0.4116\u00b10.0208\n0.4121\u00b10.0100\n0.4700\u00b10.0151\n0.4383\u00b10.0133\n0.4105\u00b10.0216\n0.4029\u00b10.0234\n\nTRAILS\n\n0.6798\u00b10.0538\n0.5443\u00b10.0127\n0.5476\u00b10.0115\n0.6702\u00b10.1184\n0.6621\u00b10.0789\n0.5342\u00b10.0186\n0.5423\u00b10.0123\n\n4.3 Performance comparison on the ADNI cohort data\n\nNow we look into a practical problem in Alzheimer\u2019s disease, i.e., prediction of cognitive scores\nvia neuroimaging features. Data used in this article were obtained from the ADNI database (adni.\nloni.usc.edu). We extract 93 regions of interest (ROIs) as neuroimaging features and use\ncognitive scores from three tests: Fluency Test, Alzheimer\u2019s Disease Assessment Scale (ADAS) and\nTrail making test (TRAILS). 795 sample subjects were involved in our study, including 180 AD\nsamples, 390 MCI samples and 225 normal control (NC) samples. Detailed data description can be\nfound in the supplementary material.\nOur goal is to construct an appropriate model to predict cognitive performance given neuroimaging\ndata. Meanwhile, we expect the model to illustrate the importance of different features in the\nprediction, which is fundamental to understanding the role of each imaging marker in the study of AD.\nFrom Table 4, we \ufb01nd that RMR models always perform equal or better than the comparing methods,\nwhich veri\ufb01es that RMR is more appropriate to learn the association between neuroimaging markers\nand cognitive performance. We can notice that RMR-L2 always performs better than LSR-L2, and\nRMR-L1 outperforms LSR-L1. This is because the symmetric noise assumption in least square\nmodels may not be guaranteed on the ADNI cohort. Compared with HuberR, our RMR model is\nshown to be less sensitive to outliers. Moreover, from the comparison between MedianR and RMR\nmodels, we can infer that conditional mode is more suitable than conditional median for the prediction\nof cognitive scores.\nRMR-L1 imposes sparse constraints on the learnt weight matrix, which naturally achieves the goal\nof feature selection in the association study. Here we take TRAILS cognitive assessment as an\nexample and look into the important neuroimaging features in the prediction. From the heat map\nand brain map in Fig. 1 and 2, we obtain several interesting \ufb01ndings. In the prediction, temporal\nlobe white matter has been picked out as a predominant feature. [10, 2] reported decreased fractional\nanisotropy (FA) and increased radial diffusivity (DR) in the white matter of the temporal lobe among\nAD and Mild Cognitive Impairment (MCI) subjects. [10] also revealed the correlation between\ntemporal lobe FA and episodic memory, which may account for the in\ufb02uence of temporal lobe to\nTMT results. Besides, there is evidence in [17] supporting the association between left temporal lobe\nand the working memory component involving letters and numbers in TMT. Moreover, angular gyrus\nindicates high correlation with TRAILS scores in our analysis. Previous research has revealed that\nangular gyrus share many clinical features with AD. [14] presented structural MRI \ufb01ndings showing\nmore left anular gyrus in MCI converters than non-converters, which pointed out the role of atrophy\nof structures like angular gyrus in the progression of dementia. [3] showed evidence for the role\nof angular gyrus in orienting spatial attention, which serves as a key factor in TMT results. The\nreplication of these results supports the effectiveness of our model.\n\n5 Conclusion\n\nThis paper proposes a new regularized modal regression and establishes its theoretical foundations\non approximation ability, sparsity, and robustness. These characterizations \ufb01ll in the theoretical\ngaps for modal regression under Tikhonov regularization. Empirical results verify the competitive\nperformance of the proposed approach on simulated data, benchmark data and real biomedical data.\n\n8\n\n\fFigure 1: Heatmap showing the weights of each neuroimaging feature via RMR-L1 model for the\nprediction of TRAILS cognitive measures. We draw two matrices, where the upper \ufb01gure is for the\nleft hemisphere and the lower \ufb01gure for the right hemisphere. Imaging markers (columns) with larger\nweights indicate higher correlation with corresponding cognitive measure in the prediction.\n\nFigure 2: Cortical maps of ROIs identi\ufb01ed in RMR-L1 model for the prediction of TRAILS cognitive\nmeasures. The brain maps show one slice of multi-view. The three maps correspond to three different\ncognitive measures in TRAILS cognitive test, respectively.\n\nWith the sparsity property of our model, we identi\ufb01ed several biological meaningful neuroimaging\nmarkers, showing the potential to enhance the understanding of onset and progression of AD.\n\nAcknowledgments\n\nThis work was partially supported by U.S. NSF-IIS 1302675, NSF-IIS 1344152, NSF-DBI 1356628,\nNSF-IIS 1619308, NSF-IIS 1633753, NIH AG049371. Hong Chen was partially supported by\nNational Natural Science Foundation of China (NSFC) 11671161. We are grateful to the anonymous\nNIPS reviewers for the insightful comments.\n\n9\n\n\fReferences\n[1] S. G. Armitage. An analysis of certain psychological tests used for the evaluation of brain injury.\n\nPsychol. Monogr., 60(1):1\u201348, 1946.\n\n[2] M. Bozzali, A. Falini, M. Franceschi, M. Cercignani, M. Zuf\ufb01, G. Scotti, G. Comi, and\nM. Filippi. White matter damage in alzheimer\u2019s disease assessed in vivo using diffusion tensor\nmagnetic resonance imaging. J. Neurol. Neurosurg. Psychiatry, 72(6):742\u2013746, 2002.\n\n[3] C. D. Chambers, J. M. Payne, M. G. Stokes, and J. B. Mattingley. Fast and slow parietal\n\npathways mediate spatial attention. Nat. Neurosci., 7(3):217\u2013218, 2004.\n\n[4] Y.-C. Chen, C. R. Genovese, R. J. Tibshirani, and L. Wasserman. Nonparametric modal\n\nregression. Ann. Statist., 44(2):489\u2013514, 2016.\n\n[5] G. Collomb, W. H\u00e4rdle, and S. Hassani. A note on prediction via estimation of the conditional\n\nmode function. J. Stat. Plan. Infer., 15:227\u2013236, 1987.\n\n[6] Y. Feng, J. Fan, and J. A. Suykens. A statistical learning approach to modal regression.\n\narXiv:1702.05960, 2017.\n\n[7] Y. Feng, X. Huang, L. Shi, Y. Yang, and J. A. Suykens. Learning with the maximum correntropy\n\ncriterion induced losses for regression. J. Mach. Learn. Res., 16:993\u20131034, 2015.\n\n[8] M. F. Folstein, S. E. Folstein, and P. R. McHugh. \u201cmini-mental state\u201d: a practical method for\ngrading the cognitive state of patients for the clinician. J. Psychiatr. Res., 12(3):189\u2013198, 1975.\n\n[9] R. He, W.-S. Zheng, and B.-G. Hu. Maximum correntropy criterion for robust face recognition.\n\nIEEE Trans. Pattern. Anal. Mach. Intell., 33(8):1561\u20131576, 2011.\n\n[10] J. Huang, R. Friedland, and A. Auchus. Diffusion tensor imaging of normal-appearing white\nmatter in mild cognitive impairment and early alzheimer disease: preliminary evidence of\naxonal degeneration in the temporal lobe. AJNR Am. J. Neuroradiol., 28(10):1943\u20131948, 2007.\n\n[11] P. J. Huber. Robust statistics. John Wiley & Sons, 1981.\n\n[12] P. J. Huber. Finite sample breakdown of m-and p-estimators. Ann. Statist., 12(1):119\u2013126,\n\n1984.\n\n[13] C. R. Jack, M. A. Bernstein, N. C. Fox, P. Thompson, G. Alexander, D. Harvey, B. Borowski,\nP. J. Britson, J. L Whitwell, C. Ward, et al. The alzheimer\u2019s disease neuroimaging initiative\n(adni): Mri methods. J. Magn. Reson. Imaging, 27(4):685\u2013691, 2008.\n\n[14] G. Karas, J. Sluimer, R. Goekoop, W. Van Der Flier, S. Rombouts, H. Vrenken, P. Scheltens,\nN. Fox, and F. Barkhof. Amnestic mild cognitive impairment: structural mr imaging \ufb01ndings\npredictive of conversion to alzheimer disease. AJNR Am. J. Neuroradiol., 29(5):944\u2013949, 2008.\n\n[15] M. Lichman. UCI machine learning repository, 2013.\n\n[16] E. Moradi, I. Hallikainen, T. H\u00e4nninen, J. Tohka, A. D. N. Initiative, et al. Rey\u2019s auditory verbal\nlearning test scores can be predicted from whole brainmri in alzheimer\u2019s disease. Neuroimage\nClin., 13:415\u2013427, 2017.\n\n[17] J. Nickel, H. Jokeit, G. Wunderlich, A. Ebner, O. W. Witte, and R. J. Seitz. Gender-speci\ufb01c\ndifferences of hypometabolism in mtle: Implication for cognitive impairments. Epilepsia,\n44(12):1551\u20131561, 2003.\n\n[18] M. Nikolova and M. K. Ng. Analysis of half-quadratic minimization methods for signal and\n\nimage recovery. SIAM J. Sci. Comput., 27(3):937\u2013966, 2005.\n\n[19] J. C. Principe. Information theoretic learning: Renyi\u2019s entropy and kernel perspectives. Springer,\n\nNew York, 2010.\n\n[20] R. T. Rockafellar. Convex Analysis. Princeton, NJ, USA: Princeton Univ. Press, 1970.\n\n10\n\n\f[21] T. W. Sager and R. A. Thisted. Maximum likelihood estimation of isotonic modal regression.\n\nAnn. Statist., 10(3):690\u2013707, 1982.\n\n[22] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B.,\n\n58(1):267\u2013288, 1996.\n\n[23] H. Wang, F. Nie, H. Huang, S. Risacher, C. Ding, A. J. Saykin, L. Shen, et al. Sparse multi-task\nregression and feature selection to identify brain imaging predictors for memory performance.\nIn Computer Vision (ICCV), 2011 IEEE International Conference on, pages 557\u2013562. IEEE,\n2011.\n\n[24] H. Wang, F. Nie, H. Huang, S. Risacher, A. J. Saykin, and L. Shen. Joint classi\ufb01cation and\nregression for identifying ad-sensitive and cognition-relevant imaging biomarkers. The 14th\nInternational Conference on Medical Image Computing and Computer Assisted Intervention\n(MICCAI 2011), pages 115\u2013123.\n\n[25] H. Wang, F. Nie, H. Huang, J. Yan, S. Kim, S. Risacher, A. Saykin, and L. Shen. High-order\nmulti-task feature learning to identify longitudinal phenotypic markers for alzheimer disease\nprogression prediction. Neural Information Processing Systems Conference (NIPS 2012), pages\n1286\u20131294.\n\n[26] X. Wang, D. Shen, and H. Huang. Prediction of memory impairment with mri data: A\nlongitudinal study of alzheimer\u2019s disease. 19th International Conference on Medical Image\nComputing and Computer Assisted Intervention (MICCAI 2016), pages 273\u2013281.\n\n[27] H. Yang and J. Yang. A robust and ef\ufb01cient estimation and variable selection method for\n\npartially linear single-index models. J. Multivariate Anal., 129:227\u2013242, 2014.\n\n[28] W. Yao and L. Li. A new regression model: modal linear regression. Scandinavian J. Statistics,\n\n41(3):656\u2013671, 2014.\n\n[29] W. Yao, B. G. Lindsay, and R. Li. Local modal regression. J. Nonparametric Stat., 24(3):647\u2013\n\n663, 2012.\n\n[30] W. Zhao, R. Zhang, J. Liu, and Y. Lv. Robust and ef\ufb01cient variable selection for semiparametric\npartially linear varying coef\ufb01cient model based on modal regression. Ann. I. Stat. Math.,\n66(1):165\u2013191, 2014.\n\n11\n\n\f", "award": [], "sourceid": 924, "authors": [{"given_name": "Xiaoqian", "family_name": "Wang", "institution": "University of Pittsburgh"}, {"given_name": "Hong", "family_name": "Chen", "institution": "University of Pittsburgh"}, {"given_name": "Weidong", "family_name": "Cai", "institution": "The University of Sydney"}, {"given_name": "Dinggang", "family_name": "Shen", "institution": "UNC-Chapel Hill"}, {"given_name": "Heng", "family_name": "Huang", "institution": "University of Pittsburgh"}]}