{"title": "On Ranking in Survival Analysis: Bounds on the Concordance Index", "book": "Advances in Neural Information Processing Systems", "page_first": 1209, "page_last": 1216, "abstract": "In this paper, we show that classical survival analysis involving censored data can naturally be cast as a ranking problem. The concordance index (CI), which quantifies the quality of rankings, is the standard performance measure for model \\emph{assessment} in survival analysis. In contrast, the standard approach to \\emph{learning} the popular proportional hazard (PH) model is based on Cox's partial likelihood. In this paper we devise two bounds on CI--one of which emerges directly from the properties of PH models--and optimize them \\emph{directly}. Our experimental results suggest that both methods perform about equally well, with our new approach giving slightly better results than the Cox's method. We also explain why a method designed to maximize the Cox's partial likelihood also ends up (approximately) maximizing the CI.", "full_text": "On Ranking in Survival Analysis: Bounds on the\n\nConcordance Index\n\nVikas C. Raykar, Harald Steck, Balaji Krishnapuram\n\nCAD and Knowledge Solutions (IKM CKS), Siemens Medical Solutions Inc., Malvern, USA\n\n{vikas.raykar,harald.steck,balaji.krishnapuram}@siemens.com\n\nMaastro Clinic, University Hospital Maastricht, University Maastricht, GROW, The Netherlands\n\n{cary.dehing,philippe.lambin}@maastro.nl\n\nCary Dehing-Oberije, Philippe Lambin\n\nAbstract\n\nIn this paper, we show that classical survival analysis involving censored data\ncan naturally be cast as a ranking problem. The concordance index (CI), which\nquanti\ufb01es the quality of rankings, is the standard performance measure for model\nassessment in survival analysis. In contrast, the standard approach to learning the\npopular proportional hazard (PH) model is based on Cox\u2019s partial likelihood. We\ndevise two bounds on CI\u2013one of which emerges directly from the properties of\nPH models\u2013and optimize them directly. Our experimental results suggest that all\nthree methods perform about equally well, with our new approach giving slightly\nbetter results. We also explain why a method designed to maximize the Cox\u2019s\npartial likelihood also ends up (approximately) maximizing the CI.\n\n1 Introduction\n\nSurvival analysis is a well-established \ufb01eld in medical statistics concerned with analyzing/predicting\nthe time until the occurrence of an event of interest\u2013e.g., death, onset of a disease, or failure of a\nmachine. It is applied not only in clinical research, but also in epidemiology, reliability engineering,\nmarketing, insurance, etc. The time between a well-de\ufb01ned starting point and the occurrence of the\nevent is called the survival time or failure time, measured in clock time or in another appropriate\nscale, e.g., mileage of a car. Survival time data are not amenable to standard statistical methods\nbecause of its two special features\u2013(1) the continuous survival time often follows a skewed distribu-\ntion, far from normal, and (2) a large portion of the data is censored (see Sec. 2). In this paper we\ntake a machine learning perspective and cast survival analysis as a ranking problem\u2013where the task\nis to rank the data points based on their survival times rather than to predict the actual survival times.\nOne of the most popular performance measures for assessing learned models in survival analysis is\nthe Concordance Index (CI), which is similar to the Wilcoxon-Mann-Whitney statistic [13, 10] used\nin bi-partite ranking problems.\nGiven the CI as a performance measure, we develop approaches that learn models by directly opti-\nmizing the CI. As optimization of the CI is computationally expensive, we focus on maximizing two\nlower bounds on the CI, namely the log-sigmoid and the exponential bounds, which are described in\nSec. 4, 5, and 6. Interestingly, the log-sigmoid bound arises in a natural way from the Proportional\nHazard (PH) model, which is the standard model used in classical survival analysis, see Sec. 5.2.\nMoreover, as the PH models are learned by optimizing Cox\u2019s partial likelihood in classical survival\nanalysis, we show in Sec. 8 that maximizing this likelihood also ends up (approximately) maxi-\nmizing the CI. Our experiments in Sec. 9 show that optimizing our two lower bounds and Cox\u2019s\nlikelihood yields very similar results with respect to the CI, with the proposed lower bounds being\nslightly better.\n\n1\n\n\f2 Survival analysis\n\nSurvival analysis has been extensively studied in the statistics community for decades, e.g., [4, 8].\nA primary focus is to build statistical models for survival time T \u2217\n\ni of individual i of a population.\n\ni , i.e., our actual observation is Ti = min(T \u2217\n\n2.1 Censored data\nA major problem is the fact that the period of observation C\u2217\ni can be censored for many individuals\ni. For instance, a patient may move to a different town and thus be no longer available for a clinical\ntrial. Also at the end of the trial a lot of patients may actually survive. For such cases the exact\nsurvival time may be longer than the observation period. Such data are referred to as right-censored,\nand C\u2217\ni is also called the censoring time. For such individuals, we only know that they survived for\nat least C\u2217\nLet xi \u2208 Rd be the associated d-dimensional vector of covariates (explanatory variables) for the\nith individual. In clinical studies, the covariates typically include demographic variables, such as\nage, gender, or race; diagnosis information like lab tests; or treatment information, e.g., dosage. An\nimportant assumption generally made is that T \u2217\ni are independent conditional on xi, i.e., the\ncause for censoring is independent of the survival time. With the indicator function \u03b4i, which equals\n1 if failure is observed (T \u2217\ni ), the available training data\ncan be summarized as D = {Ti, xi, \u03b4i}N\ni=1 for N patients. The objective is to learn a predictive\nmodel for the survival time as a function of the covariates.\n\ni and C\u2217\n\ni , C\u2217\ni ).\n\ni \u2264 C\u2217\n\ni ) and 0 if data is censored (T \u2217\n\ni > C\u2217\n\n2.2 Failure time distributions\n\nThe failures times are typically modeled to follow a distribution, which absorbs both truly random\neffects and causes unexplained by the (available) covariates. This distribution is characterized by the\nsurvival function S(t) = Pr[T > t] for t > 0, which is the probability that the individual is still alive\nat time t. A related function commonly used is the hazard function. If T has density function p, then\nthe hazard function is de\ufb01ned by \u03bb(t) = lim\u2206t\u21920 Pr[t < T \u2264 t + \u2206t|T > t]/\u2206t = p(t)/S(t). The\nhazard function measures the instantaneous rate of failure, and provides more insight into the failure\n0 \u03bb(u)du is called the cumulative hazard function, and it holds\n\nmechanisms. The function \u039b(t) =(cid:82) t\n\nthat S(t) = e\u2212\u039b(t) [4].\n\n2.3 Proportional hazard model\n\nProportional hazard (PH) models have become the standard for studying the effect of the covariates\non the survival time distributions, e.g., [8]. Speci\ufb01cally, the PH model assumes a multiplicative\neffect of the covariates on the hazard function, i.e.,\n\n\u03bb(t|x) = \u03bb0(t)ew(cid:62)x,\n\n(1)\nwhere \u03bb(t|x) is the hazard function of a person with covariates x; \u03bb0(t) is the so-called baseline\nhazard function (i.e., when x = 0), which is typically based on the exponential or the Weibull\ndistributions; w is a set of unknown regression parameters, and ew(cid:62)x is the relative hazard function.\nEquivalent formulations for the cumulative hazard function and the survival function include\n\n\u2212(cid:104)\n(cid:105)\new(cid:62) x(cid:82) \u03bb0(t)dt\n\n.\n\n(2)\n\n\u039b(t|x) = \u039b0(t)ew(cid:62)x,\n\nand S(t|x) = e\u2212\u039b0(t)ew(cid:62) x = e\n\n2.4 Cox\u2019s partial likelihood\n\nCox noticed that a semi-parametric approach is suf\ufb01cient for estimating the weights w in PH models\n[2, 3], i.e., the baseline hazard function can remain completely unspeci\ufb01ed. Only a parametric\nassumption concerning the effect of the covariates on the hazard function is required. Parameter\nestimates in the PH model are obtained by maximizing Cox\u2019s partial likelihood (of the weights)\n[2, 3]:\n\n(cid:80)\n\new(cid:62)xi\nTj\u2265Ti\n\new(cid:62)xj\n\n.\n\n(3)\n\nL(w) = (cid:89)\n\nTi uncensored\n\n2\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Order graphs representing the ranking constraints. (a) No censored data and (b) with censored data.\nThe empty circle represents a censored point. The points are arranged in the increasing value of their survival\ntimes with the lowest being at the bottom. (c) Two concave lower bounds on the 0-1 indicator function.\n\nEach term in the product is the probability that the ith individual failed at time Ti given that exactly\none failure has occurred at time Ti and all individuals for which Tj \u2265 Ti are at risk of failing. Cox\nand others have shown that this partial log-likelihood can be treated as an ordinary log-likelihood to\nderive valid (partial) maximum likelihood estimates of w [2, 3].\nThe interesting properties of the Cox\u2019s partial likelihood include: (1) due to its parametric form, it\ncan be optimized in a computationally ef\ufb01cient way; (2) it depends only on the ranks of the observed\nsurvival times, cf. the inequality Tj \u2265 Ti in Eq. 3, rather than on their actual numerical values. We\noutline this connection to the ranking of the times Ti\u2013and hence the concordance index\u2013in Sec. 8.\n\n3 Ordering of Survival times\n\nCasting survival analysis as ranking problem is an elegant way of dealing not only with the typically\nskewed distributions of survival times, but also with the censoring of the data: Two subjects\u2019 survival\ntimes can be ordered not only if (1) both of them are uncensored but also if (2) the uncensored time\nof one is smaller than the censored survival time of the other. This can be visualized by means of an\norder graph G = (V,E), cf. also Fig. 1. The set of vertices V represents all the individuals, where\neach \ufb01lled vertex indicates an observed/uncensored survival time, while an empty circle denotes a\ncensored observation. Existence of an edge Eij implies that Ti < Tj. An edge cannot originate\nfrom a censored point.\n\n3.1 Concordance index\n\nFor these reasons, the concordance index (CI) or c-index is one of the most commonly used per-\nformance measures of survival models, e.g., [6]. It can be interpreted as the fraction of all pairs of\nsubjects whose predicted survival times are correctly ordered among all subjects that can actually be\nordered. In other words, it is the probability of concordance between the predicted and the observed\nsurvival. It can be written as\n\nc(D,G, f) =\n\n1\n|E|\n\n1f (xi)Ti\n\n1f (xi) f(xj) implies that the survival time of patient i is larger than the one of patient j. Given\n\nthe data D and the order graph G, the optimal ranking function is (cid:98)f = arg maxf\u2208F c(D,G, f). As\n\nto prevent over\ufb01tting on the training data, regularization can be added to this equation, see Secs. 5\nand 6. In many cases, suf\ufb01cient regularization is also achieved by restricting the function class F,\ne.g., it may contain only linear functions. For ease of exposition we will consider the family of linear\nranking functions 1 in this paper: F = {fw}, where for any x, w \u2208 Rd, fw(x) = w(cid:62)x.\n\n4 Lower bounds on the CI\n\nMaximizing the CI is a discrete optimization problem, which is computationally expensive. For\nthis reason, we resort to maximizing a differentiable and concave lower bound on the 0-1 indicator\nfunction in the concordance index, cf. Eqs. 4 and 5. In this paper we focus on the log-sigmoid lower\nbound [12], cf. Sec. 5, and exponential lower bound, cf. Sec. 6, which are suitably scaled as to be\ntight at the origin and also in the asymptotic limit of large positive values, see also Fig. 1(c). We will\nalso show how these bounds relate to the classical approaches in survival analysis: as it turns out,\nfor the family of linear ranking functions, these two approaches are closely related to the PH model\ncommonly used in survival analysis, cf. Sec. 5.2.\n\n5 Log-sigmoid lower bound\n\nThe \ufb01rst subsection discusses the lower bound on the concordance index based on the log-sigmoid\nfunction. The second subsection shows that this bound arises naturally when using proportional\nhazard models.\n\n5.1 Lower bound\nThe sigmoid function is de\ufb01ned as \u03c3(z) = 1/(1+e\u2212z), While it is an approximation to the indicator\nfunction, it is not a lower bound. In contrast, the scaled version of the log of the sigmoid function,\nlog [2\u03c3(z)]/ log 2, is a lower bound on the indicator function (Fig. 1(c)), i.e.,\n\n1z>0 \u2265 1 + (log \u03c3(z)/log 2).\n\n(6)\n\nThe log-sigmoid function is concave and asymptotically linear for large negative values, and may\nhence be considered a differentiable approximation to the hinge loss, which is commonly used for\n\nf (x) =(cid:80)N\n\n1Generalization to non-linear functions can be achieved easily by using kernels: the linear ranking function\nclass F is replaced by H, a reproducing kernel Hilbert space (RKHS). The ranking function then is of the form\n\ni=1 \u03b1ik(x, xi) where k is the kernel of the RHKS H.\n\n4\n\n\ftraining support vector machines. The lower bound on the concordance index (cf. Eq. 4) follows\nimmediately:\n\n1 + (log \u03c3[f(xj) \u2212 f(xi)]/log 2) \u2261(cid:98)cLS,\n\nwhich can ef\ufb01ciently be maximized by gradient-based methods (cf. Sec 7). Given the linear ranking\n\nc =\n\n(cid:88)\n\n(cid:88)\nfunction fw(x) = w(cid:62)x, the bound(cid:98)cLS becomes\n\n1f (xj )\u2212f (xi)>0 \u2265 1\n|E|\n\n1\n|E|\n\nEij\n\nEij\n\n(cid:98)cLS(w) =\n\n1\n|E|\n\n(cid:88)\n\nEij\n\n(7)\n\n(9)\n\n1 + (log \u03c3[w(cid:62)(xj \u2212 xi)]/log 2).\n\n(8)\n\nAs to avoid over\ufb01tting, we penalize functions with a large norm w in the standard way, and obtain\nthe regularized version\n\n(cid:98)cLSreg(w) = \u2212 \u03bb\n\n(cid:107)w(cid:107)2 +(cid:98)cLS(w).\n\n2\n\n5.2 Connection to the PH model\n\nThe concordance index can be interpreted as the probability of correct ranking (as de\ufb01ned by the\ngiven order graph) given a function f. Its probabilistic version can thus be cast as a likelihood.\nUnder the assumption that each pair (j, i) is independent of any other pair, the log-likelihood reads\n\nPr [fw(xi) < fw(xj)|w] .\n\n(10)\n\nL(fw,D,G) = log(cid:89)\n\nEij\n\nAs this independence assumption obviously does not hold among all pairs due to transitivity (even\nthough the individual samples i are assumed i.i.d.), it provides a lower bound on the concordance\nindex.\nWhile the probability of correct pairwise ordering, Pr [fw(xi) < fw(xj)|w], is often chosen to be\nsigmoid in the ranking literature [1], we show in the following that the sigmoid function arises\nnaturally in the context of PH models. Let T (w(cid:62)x) denote the survival time for the patient with\ncovariates x or relative log-hazard w(cid:62)x. A larger hazard corresponds to a smaller survival time, cf.\nSec. 2. Hence\nPr [fw(xi) < fw(xj)|w] = Pr[T (w(cid:62)xj) > T (w(cid:62)xi)|w] =\n\nPr[T (w(cid:62)xj) > t]p(t|xi)dt\n\n(cid:90) \u221e\n\n(cid:90) \u221e\n\n0\n\n(cid:90) \u221e\n\n0\n\n=\n\nS(t|xj)p(t|xi)dt =\n\n0\n\n\u2212S(t|xj)S\n\n(cid:48)\n\n(t|xi)dt,\n\nwhere p(t|xi) is the density function of T for patient i with covariate xi, and S(t|xi) is the corre-\n(cid:48)(t) = dS(t)/dt = \u2212p(t). Using Eq. 2 of the PH model, we continue\nsponding survival function; S\nthe manipulations:\n\nPr [fw(xi) < fw(xj)|w] = \u2212ew(cid:62)xi\n\n\u2212\u039b0(t)\n\ne\n\n(cid:90) \u221e\n\n(cid:26)\new(cid:62) xj +ew(cid:62) xi\n\n(cid:27)\n\n(cid:48)\n\u039b\n0(t)dt\n\n0\n\new(cid:62)xi\n\new(cid:62)xj + ew(cid:62)xi\n\n=\n\n= \u03c3[w(cid:62)(xi \u2212 xj)].\n\n(11)\n\nThis derivation shows that the probability of correct pairwise ordering indeed follows the sigmoid\nfunction. Assuming a prior Pr[w] = N (w|0, \u03bb\u22121) for regularization, the optimal maximum a-\n\nposteriori (MAP) estimator is of the form (cid:98)wMAP = arg max L(w), where the posterior L(w) takes\n\nthe form of a penalized log-likelihood:\nL(w) = \u2212 \u03bb\n2\n\n(cid:107)w(cid:107)2 +(cid:88)\n\nlog \u03c3(cid:2)wT (xj \u2212 xi)(cid:3) .\n\nEij\n\n(12)\n\nThis expression is equivalent to (8) except for a few constants that are irrelevant for optimization\nproblem, which justi\ufb01es our choice of regularization in Eq. 8.\n\n5\n\n\f6 Exponential lower bound\nThe exponential 1 \u2212 e\u2212z can serve as an alternative lower bound on the step indicator function (see\nFig. 1(c)). The concordance index can then be lower-bounded by\n\nAnalogous to the log-sigmoid bound, for the linear ranking function fw(x) = w(cid:62)x, the lower bound\n\n(cid:98)cE simpli\ufb01es to\n\n(13)\n\n(14)\n\n(15)\n\n(cid:88)\n\nEij\n\nc \u2265 1\n|E|\n\n(cid:98)cE(w) =\n(cid:98)cEreg(w) = \u2212 \u03bb\n\n2\n\n1 \u2212 e\u2212[f (xj )\u2212f (xi)] \u2261(cid:98)cE.\n(cid:88)\n\n1 \u2212 e\u2212w(cid:62)(xj\u2212xi),\n\n1\n|E|\n\nEij\n\n(cid:88)\n\nEij\n\n1\n|E|\n\nand, penalizing functions with large norm w, the regularized version reads\n1 \u2212 e\u2212w(cid:62)(xj\u2212xi).\n\n(cid:107)w(cid:107)2 +\n\n7 Gradient based learning\n\nIn order to maximize the regularized concave surrogate we can use any gradient-based learning\ntechnique. We use the Polak-Ribi`ere variant of nonlinear conjugate gradients (CG) algorithm [11].\nThe CG method only needs the gradient g(w) and does not require evaluation of the function. It also\navoids the need for computing the second derivatives. The convergence of CG is much faster than\nthat of the steepest descent. Using the fact that d\u03c3(z)/dz = \u03c3(z)[1 \u2212 \u03c3(z)] and 1 \u2212 \u03c3(z) = \u03c3(\u2212z),\n(xi \u2212\n\nthe gradient of Eq. 9 (log-sigmoid bound) is given by \u2207w(cid:98)cLSreg(w) = \u2212\u03bbw \u2212 1\nxj)\u03c3(cid:2)wT (xi \u2212 xj)(cid:3), and the gradient of Eq. 15 (exponential bound) by \u2207w(cid:98)cEreg(w) = \u2212\u03bbw \u2212\n(cid:80)Eij\n\n(xi \u2212 xj)e\u2212w(cid:62)(xj\u2212xi).\n\n(cid:80)Eij\n\n|E| log 2\n\n1|E|\n\n8 Is Cox\u2019s partial likelihood a lower bound on the CI ?\n\n= 1 \u2212 No|E|\nterms of zi as L(w) =(cid:81)\n\nOur experimental results (Sec. 9) indicate that the Coxs method and our proposed methods showed\nsimilar performance when assessed using the CI. While our proposed method was formulated to\nexplicitly maximize a lower bound on the concordance index, the Coxs method maximized the\npartial likelihood. One suspects whether Coxs partial likelihood itself is a lower bound on the\nconcordance index. The argument presented below could give an indication as to why a method\nwhich maximizes the partial likelihood also ends up (approximately) maximizing the concordance\nindex. We re-write the exponential bound on the CI for proportional hazard models from Sec. 6\new(cid:62)xj ]\n\ne\u2212w(cid:62)xi[ (cid:88)\n\n(cid:88)\n\n(cid:98)cE(w) =\n\n1\n|E|\n\n(cid:88)\n1 \u2212 e\u2212w(cid:62)(xi\u2212xj ) = 1 \u2212 1\n|E|\n(cid:88)\n\n(cid:33)\n\nTj\u2265Ti\n\n, where\n\nzi =\n\n1/zi\n\n(cid:80)\n\nTi uncensored\new(cid:62)xi\nTj\u2265Ti\n\new(cid:62)xj\n\nTj\u2265Ti\n\u2208 [0, 1].\n\n(cid:88)\n(cid:32)\n\nTi uncensored\n1\nNo\n\n(16)\n\nTi uncensored\n\nNote that we have replaced Tj > Ti by Tj \u2265 Ti, assuming that there are no ties in the data, i.e., no\ntwo survival times are identical, analogous to Cox\u2019s partial likelihood approach (cf. Sec. 2.4). The\nnumber of uncensored observations is denoted by No. The Cox\u2019s partial likelihood can be written in\ngeom, where (cid:104)zi(cid:105)geom denotes the geometric mean of\nthe zi with uncensored Ti. Using the inequality zi \u2265 min zi the concordance index can be bounded\nas\n\nTi uncensored zi = (cid:104)zi(cid:105)No\n\nc \u2265 1 \u2212 No|E|\n\n1\n\nmin zi\n\n.\n\n(17)\n\nThis says maximizing min zi maximizes a lower bound on the concordance index. While this does\nnot say anything about the Cox\u2019s partial likelihood it still gives a useful insight. Since max zi = 1\n(because zi = 1 for the largest uncensored Ti), maximizing min zi can be expected to approximately\nmaximize the geometric mean of zi, and hence the Cox\u2019s partial likelihood.\n\n6\n\n\fTable 1: Summary of the \ufb01ve data sets used. N is the number of patients. d is the number of covariates used.\n\nDataset\nN\n285\nMAASTRO\n477\nSUPPORT-1\n314\nSUPPORT-2\nSUPPORT-4\n149\nMELANOMA 191\n\nd Missing Censored\n30.5%\n19\n26\n36.4%\n43.0%\n26\n10.7%\n26\n4\n70.2%\n\n3.6%\n14.9%\n16.6%\n22.0%\n0.0%\n\n9 Experiments\nIn this section we compare the performance of the two different lower bounds on the CI\u2014the log-\nsigmoid, exponential, and Cox\u2019s partial likelihood\u2014on \ufb01ve medical data sets.\n\n9.1 Medical datasets\nTable 1 summarizes the \ufb01ve data sets we used in our experiments. A substantial amount of data\nis censored and also missing. The MAASTRO dataset concerns the survival time of non-small\ncell lung cancer patients, which we analyzed as part of our collaboration. The other medical data\nsets are publicly available: The SUPPORT dataset 2 is a random sample from Phases I and II of\nthe SUPPORT [9](Study to Understand Prognoses Preferences Outcomes and Risks of Treatment)\nstudy. As suggested in [6] we split the dataset into three different datasets, each corresponding to a\ndifferent cause of death. The MELANOMA data 3 is from a clinical study of skin cancer.\n\n9.2 Evaluation procedure\nFor each data set, 70% of the examples were used for training and the remaining 30% as the hold-out\nset for testing. We chose the optimal value of regularization parameter \u03bb (cf. Eqs. 9 and 15) based\non \ufb01ve-fold cross validation on the training set. The tolerance for the conjugate gradient procedure\nwas set to 10\u22123. The conjugate-gradient optimization procedure was initialized to the zero vector.\nAll the covariates were normalized to have zero mean and unit variance. As missing values were\nnot the focus of this paper, we used a simple imputation technique. For each missing value, we\nimputed a sample drawn from a Gaussian distribution with its mean and variance estimated from the\navailable values of the other patients.\n\n9.3 Results\nThe performance was evaluated in terms of the concordance index and the results are tabulated in\nTable 2. We compare the following methods\u2013(1) Cox\u2019s partial likelihood method, and (2) the pro-\nposed ranking methods with log-sigmoid and exponential lower bounds. The following observations\ncan be made\u2013(1) The proposed linear ranking method performs slightly better than the Cox\u2019s par-\ntial likelihood method, but the difference does not appear signi\ufb01cant. This agrees with our insights\nthat Cox\u2019s partial likelihood may also end up maximizing the CI. (2) The exponential bound shows\nslightly better performance than the log-sigmoid bound, which may indicate that the tightness of the\nbound for positive z in Fig. 1(c) is more important than for negative z in our data sets. However the\ndifference is not signi\ufb01cant.\n10 Conclusions\n\nIn this paper, we outlined several approaches for maximizing the concordance index, the standard\nperformance measure in survival analysis when cast as a ranking problem. We showed that, for the\nwidely-used proportional hazard models, the log-sigmoid function arises as a natural lower bound\non the concordance index. We presented an approach for directly optimizing this lower bound in\na computationally ef\ufb01cient way. This optimization procedure can also be applied to other lower\nbounds, like the exponential one. Apart from that, we showed that maximizing Cox\u2019s partial like-\nlihood can be understood as (approximately) maximizing a lower bound on the concordance index,\nwhich explains the high CI-scores of proportional hazard models observed in practice. Optimization\nof each of these three lower bounds results in about the same CI-score in our experiments, with our\nnew approach giving tentatively better results.\n\n2http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/DataSets.\n3www.stat.uni-muenchen.de/service/datenarchiv/melanoma/melanoma_e.html\n\n7\n\n\fTable 2: Concordance indices for the different methods and datasets. The mean and the standard deviation\nare computed over a \ufb01ve fold cross-validation. The results are also shown for a \ufb01xed holdout set.\n\nCI for\n\nCI for\ntest set\n\nMAASTRO\n\nlog-sigmoid\nexponential\nSUPPORT-1\n\nlog-sigmoid\nexponential\nSUPPORT-2\n\ntraining set\nmean [\u00b1 std] mean [\u00b1 std]\n0.57 [\u00b10.09]\n0.60 [\u00b10.06]\n0.64 [\u00b10.08]\n0.74 [\u00b10.05]\n0.77 [\u00b10.04]\n0.79 [\u00b10.02]\n0.63 [\u00b10.06]\n0.68 [\u00b10.06]\n0.68 [\u00b10.09]\n0.68 [\u00b10.09]\n0.74 [\u00b10.12]\n0.73 [\u00b10.03]\n0.62 [\u00b10.09]\n0.70 [\u00b10.10]\n0.65 [\u00b10.11]\n\nCox PH 0.65 [\u00b10.02]\n0.69 [\u00b10.02]\n0.69 [\u00b10.02]\nCox PH 0.76 [\u00b10.01]\n0.83 [\u00b10.01]\n0.83 [\u00b10.01]\nCox PH 0.70 [\u00b10.02]\n0.79 [\u00b10.01]\n0.78 [\u00b10.02]\nCox PH 0.78 [\u00b10.01]\n0.80 [\u00b10.01]\n0.79 [\u00b10.01]\nCox PH 0.63 [\u00b10.03]\n0.76 [\u00b10.02]\n0.76 [\u00b10.01]\n\nlog-sigmoid\nexponential\nSUPPORT-4\n\nlog-sigmoid\nexponential\nMELANOMA\n\nlog-sigmoid\nexponential\n\nCI for\nholdout\n\nset\n\n0.64\n0.64\n0.65\n\n0.79\n0.79\n0.82\n\n0.69\n0.65\n0.70\n\n0.64\n0.71\n0.71\n\n0.54\n0.55\n0.55\n\nAcknowledgements\n\nWe are grateful to R. Bharat Rao for encouragement and support of this work, and to the anonymous\nreviewers for their valuable comments.\nReferences\n[1] C.J.C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to\nrank using gradient descent. In Proceeding of the 22th International Conference on Machine Learning,\n2005.\n\n[2] D. R. Cox. Regression models and life-tables (with discussion). Journal of the Royal Statistical Society,\n\nSeries B, 34(2):187\u2013220, 1972.\n\n[3] D. R. Cox. Partial likelihood. Biometrika, 62(2):269\u2013276, 1975.\n[4] D. R. Cox and D. Oakes. Analysis of survival data. Chapman and Hall, 1984.\n[5] Y. Freund, R. Iyer, and R. Schapire. An ef\ufb01cient boosting algorithm for combining preferences. Journal\n\nof Machine Learning Research, 4:933\u2013969, 2003.\n\n[6] F. E. Harrell Jr. Regression Modeling Strategies, With Applications to Linear Models, Logistic Regression,\n\nand Survival Analysis. Springer, 2001.\n\n[7] R. Herbrich, T. Graepel, P. Bollmann-Sdorra, and K. Obermayer. Learning preference relations for infor-\n\nmation retrieval. ICML-98 Workshop: Text Categorization and Machine Learning, pages 80\u201384, 1998.\n\n[8] J. D. Kalb\ufb02eisch and R. L. Prentice. The statistical analysis of failure time data. Wiley-Interscience,\n\n2002.\n\n[9] W.A. Knaus, F. E. Harrell, J. Lynn, et al. The support prognostic model: Objective estimates of survival\n\nfor seriously ill hospitalized adults. Annals of Internal Medicine, 122:191\u2013203, 1995.\n\n[10] H. B. Mann and D. R. Whitney. On a Test of Whether one of Two Random Variables is Stochastically\n\nLarger than the Other. The Annals of Mathematical Statistics, 18(1):50\u201360, 1947.\n\n[11] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 1999.\n[12] V. C. Raykar, R. Duraiswami, and B. Krishnapuram. A fast algorithm for learning large scale preference\nrelations. In M. Meila and X. Shen, editors, Proceedings of the Eleventh International Conference on\nArti\ufb01cial Intelligence and Statistics, pages 385\u2013392, 2007.\n\n[13] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80\u201383, December\n\n1945.\n\n8\n\n\f", "award": [], "sourceid": 535, "authors": [{"given_name": "Harald", "family_name": "Steck", "institution": null}, {"given_name": "Balaji", "family_name": "Krishnapuram", "institution": null}, {"given_name": "Cary", "family_name": "Dehing-oberije", "institution": null}, {"given_name": "Philippe", "family_name": "Lambin", "institution": null}, {"given_name": "Vikas", "family_name": "Raykar", "institution": null}]}