{"title": "Ordered Classes and Incomplete Examples in Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 550, "page_last": 556, "abstract": null, "full_text": "Ordered Classes and Incomplete Examples \n\nin Classification \n\nMark Mathieson \n\nDepartment of Statistics, University of Oxford \n1 South Parks Road, Oxford OXI 3TG, UK \n\nE-mail: mathies@stats.ox.ac.uk \n\nAbstract \n\nThe classes in classification tasks often have a natural ordering, and the \ntraining and testing examples are often incomplete. We propose a non(cid:173)\nlinear ordinal model for classification into ordered classes. Predictive, \nsimulation-based approaches are used to learn from past and classify fu(cid:173)\nture incomplete examples. These techniques are illustrated by making \nprognoses for patients who have suffered severe head injuries. \n\n1 Motivation \n\nJennett et al. (1979) reported data on patients with severe head injuries. For each patient \nsome of the information in Table 1 was available shortly after injury. The objective is to \npredict the degree of recovery attained within six months as measured by outcome. This \nproblem exhibits two characteristics that are common in classification tasks: allocation qf \nexamples into classes which have a natural ordering, and learning from past and classifying \nfuture incomplete examples. \n\n2 A Flexible Model for Ordered Classes \n\nThe Bayes decision rule (see, for example, Ripley, 1996) depends on the loss L(j, k) in(cid:173)\ncurred in assigning to class k an object belonging to class j. When better information is \nunavailable, for unordered or nominal classes we treat every mis-classification as equally \nserious: LU, k) is 0 when j = k and 1 otherwise. For ordered classes, when the K classes \nare numbered from 1 to K in their natural order, a better default choice is LU, k) =1 j - k I. \nA class is then given support by its position in the ordering, and the Bayes rule will some(cid:173)\ntimes assign patterns to classes that do not have maximum posterior probability to avoid \nmaking a serious error. \n\n\fOrdered Classes and Incomplete Examples in Classification \n\n551 \n\nTable 1: Definition of variables with proportion missing. \n\nVariable \nage \nemv \nmotor \nchange \neye \npupils \noutcome \n\nDefinition \n\nAge in decades (1=0-9, 2=10-19, ... ,8=70+). \nMeasure of eye, motor and verbal response to stimulation (1-7). \nMotor response patterns for all limbs (1-7). \nChange in neurological function over the first 24 hours (-1,0,+1). \nEye indicant. 1 (bad), 2 (impaired), 3 (good). \nPupil reaction to light. 1 (non-reacting), 2 (reacting). \nRecovery after six months based on Glasgow Outcome Scale. \n1 (dead/vegetative), 2 (severe disability), 3 (moderate/good recovery). \n\nMissing % \n\n\u00b0 41 \n\n33 \n78 \n65 \n30 \n\n\u00b0 \n\nIf the classes in a classification problem are ordered the ordering should also be reflected \nin the probability model. Methods for nominal tasks can certainly be used for ordinal \nproblems, but an ordinal model should have a simpler parameterization than comparable \nnominal models, and interpretation will be easier. Suppose that an example represented by \na row vector X belongs to class C = C(X). To make the Bayes-optimal classification it \nis sufficient to know the posterior probabilities p(C = k I X = x). The ordinallogis(cid:173)\ntic regression (OLR) model for K ordered classes models the cumulative posterior class \nprobabilities p( C ~ k I X = x) by \n[ p( C ~ k I X = x) \n\nlog 1 _ p(C ~ k I X = x) = \u00a2>k -1](x) \n\nk = 1, ... ,K -1, \n\n(1) \n\n] \n\nfor some function 1]. We impose the constraints \u00a2>1 ~ . . . ~ \u00a2>K-l on the cut-points to \nensure thatp(C ~ k I X = x) increases with k. If \u00a2>o = -00 and \u00a2>K = 00 then (1) gives \n\np(C = k I X = x) = a(\u00a2>k -1](x)) - a(\u00a2>k-l -1](x)) \n\nk=l, ... ,K \n\nwhere a(x) = 1/(1 + e- X ). McCullagh (1980) proposed linear OLR where 1](x) = x{3. \nThe posterior probabilities depend on the patterns x only through 1], and high values of \n1](x) correspond to higher predicted classes (Figure la). This can be useful for interpreting \nthe fitted model. However, linear OLR is rather inflexible since the decision boundaries are \nalways parallel hyperplanes. Departures from linearity can be accommodated by allowing \n1] to be a non-linear function of the feature space. We extend OLR to non-linear ordinal \nlogistic regression (NOLR) by letting 1](x) be the single linear output of a feed-forward \nneural network with input vector x, having skip-layer connections and sigmoid transfer \nfunctions in the hidden layer (Figure Ib). Then for weights Wij and biases bj we have \n\n1](x) = 2: WioXCi) + 2: wjoa(bj + 2: WijXCi\u00bb), \n\ni-to \n\nj-to \n\ni-tj \n\nwhere :Li-tj denotes the sum over i such that node i is connected to node j, and node \no is the single output node. The usual output-unit bias is incorporated in the cut-points. \nObserve that OLR is the special case of NOLR with no hidden nodes. Although the network \ncomponent of NOLR is a universal approximator the NOLR model cannot approximate all \nprobability densities arbitrarily well (unlike 'softmax', the most similar nominal method). \nThe likelihood for the cut-points l/> = (\u00a2>1, ... ,\u00a2> K -1) and network parameters w given a \ntraining set T = {(Xi, Ci) Ii = 1, ... ,n} ofn correctly classified examples is \n\nn \n\n\u00a3(w, l/\u00bb = IIp(Ci I Xi) = II [a(\u00a2>Ci -1](Xi; w)) - a(\u00a2>ci-l -1](Xi; w))] . \n\nn \n\n(2) \n\ni=l \n\ni=l \n\n\f552 \n\nM. Mathieson \n\nq , - - - - - - - - - - - - - - - - - - , \n\np( 11 eta) \np(21 eta) \np(31 eta) \np(41 eta) \np(SI eta) \n\nco o \n\n'\" o \n\n-10 \n\n-8 \n\n-2 \n-6 \nNetwork output (eta) \n\n-4 \n\no \n\n2 \n\no \n\n20 \n\n40 \nage (years) \n\n60 \n\nFigure 1: (a) p(k 1 \"I) plotted against \"I for an OLR model with K = 5 classes and 4> = \n(-7, -6, -3, -1). (b) The network output TJ(x) from a NOLR model used to predict change given \nall other variables (except outcome) predicts that young patients with high emv score are likely to \nimprove over first 24 hours. While age and emv are varied, other variables are fixed. Dark shading \ndenotes low values ofTJ(x). The Bayes decision boundaries are shown for loss L(j, k) =1 j - k I. \n\nIf we estimate the classifier by substituting the maximum likelihood estimates we must \nmaximize (2) whilst constraining the cut-points to be increasing (Mathieson, 1996). To \navoid over-fitting we regularize both by weight decay (which is equivalent to putting inde(cid:173)\npendent Gaussian priors on the network weights) and by imposing independent Gamma pri(cid:173)\nors on the differences between adjacent cut-points. The minim and is now -log f(w, l/\u00bb + \n>..D(w) + E(l/>; t, 0:) with hyperparameters >.. > 0, t, 0: (to be chosen by cross-validation, \nfor example, or averaged over under a Bayesian scheme) where D(w) = 2:i,j W;j and \n\nK-l \n\nE(l/\u00bb = L [t( Unconditional imputation (Sample missing values from the \n\nempirical distribution of each variable in the training set.) \n\n[> Gibbs sampling from p(XU I xo,,,fJ) \n\nPool the 1000 completions from the line above to form a single training set \n\nTest set loss \n\n132 \n\n149 \n\n133 \n118 \n117 \n\n(Ripley, 1994) approximates each posterior by a mixture of Gaussians centred at the local \nmaxima Oj1, ... ,0jRj of p( fJ 1 T, X}L) to give \n\n(7) \n\nwhere: N(\u00b7; j.\u00a3, E) is the Gaussian density function with mean j.\u00a3 and covariance matrix \nE, the Hessian Hjr = &()~~&() 10gp(fJ 1 T, XJ') is evaluated at Ojr and, using Laplace's \napproximation, Wjr = p(Ojr 1 T, Xl) 1 Hjr 1- 1/ 2 . We can average over the maxima to get \np(c 1 x) ~ (m l:j,r Wjr )-ll:j,r P(c I x; Ojr), butthe full-blooded approach samples from \nthe 'mixture of mixtures' approximation to p( fJ 1 T) and also uses importance sampling to \ncompute the predictive estimates p. \n\n3.2 The Imputation Model \nWe need samples from p(xy I xi, Ci) for each i. When many patterns of missing val(cid:173)\nues occur it is not practical to model p(XU 1 xo, c) for each pattern, but Markov chain \nMonte Carlo methods can be employed. The Gibbs sampler is convenient and in its most \nbasic form requires models for the distribution of each element of x given the others, \nthat is p(x(j) 1 x( -j), c) where x( -j) = (X(l), ... ,x(j-1), x(j+1) , ... ,x(p\u00bb. We model \nthese full conditionals parametrically as p( xU) 1 x( - j) , c; 'I/J) and assume here that the pa(cid:173)\nrameters for each of the full conditionals are disjoint, so p(x(j) I x( -j), C; 'I/J(j\u00bb where \n'I/J = ('I/J(1), ... ,'I/J(p\u00bb. When x(j} takes discrete values this is a classification task, and \nfor continuous values a regression problem. Under certain conditions the chain of depen(cid:173)\ndent samples of Xu converges in distribution to p( XU I xo, 'I/J) and the ergodic average \nof p(c I xo, XU) converges as required to the predictive estimate p(c I Xo). We usually \ntake every wth sample to provide a cover of the space in fewer samples, reducing the com(cid:173)\nputation required to learn the classifier. It is essential to check convergence of the Gibbs \nsampler although we do not give details here. \nIf we have sufficient complete examples we might use them to estimate 'I/J to be -J; and \nGibbs sample from p(XU 1 xo; -J;). Otherwise, in the Bayesian framework, incorporate 'I/J \ninto the sampling scheme by Gibbs sampling from p( 'I/J, Xu I XO) (the solution suggested \nby Li, 1988). In the head injury example we report results using the former approach. (The \nlatter was found to make little improvement and requires considerably more computation \ntime.) \n\n\fOrdered Classes and Incomplete Examples in Classification \n\n555 \n\nTable 3: Predictive approximations for a NOLR model fitted to a single completion T, Xu of the \ntraining set. The likelihood maxima at {h and {h account for over 0.99 of the posterior probability. \n\nPosterior probability \n-logp(Oi I T, XU} \nTest set loss: \n\n\u2022 using the plug-in classifier p( c I x; Oi} \n\u2022 averaging over 10,000 samples from Gaussian \n\n0.929 \n176.10 \n\n0.071 \n174.65 \n\nPredictive: \n\n128 \n120 \n\n149 \n137 \n\n126 \n119 \n\n3.3 Classifying Incomplete Examples \n\nWe could build a separate classifier for each pattern of missing data that occurs, but this \ncan be computationally expensive, will lose information and the classifiers need not make \nconsistent predictions. We know that p(c I XO) = IExulxop(c I xo, XU) so it seems better \nto classify Xo by averaging over repeated imputations of XU from the imputation model. \n\n4 Prognosis After Head Injury \n\nWe now return to the head injury prognosis example to learn a NOLR classifier from a \ntraining set containing 40 complete and 206 incomplete examples. The NOLR architec(cid:173)\nture (4 nodes, skip-layer connections and A = 0.01) was selected by cross-validation on \na single imputation of the training set, and we use a predictive approximation. 1 Table 2 \nshows the performance of this classifier on a test set of 301 complete examples and loss \nL (j, k) = I j - k I for different strategies for dealing with the missing values. For imputation \nby Gibbs sampling we modelled each of the full conditionals using NOLR because all vari(cid:173)\nables in this dataset are ordinal. Categorical inputs to models are put in as level indicators, \nso change corresponds to two indicators taking values (0,0), (1,0) and (1,1). Throughout \nthis example we predict age, emv and motor as categorical variables but treat them as con(cid:173)\ntinuous inputs to models. Models were selected by cross-validation based on the complete \ntraining examples only and used the predictive approximation described above. Several full \nconditionals benefited from a non-linear model. \n\nWe now classify 199 incomplete test examples using the classifier found in the last line \nof Table 2. Median imputation of missing values in the test set incurs loss 132 whereas \nunconditional imputation incurs loss 106. The Gibbs sampling imputation model incurs \nloss 91 and is predicting probabilities accurately (Figure 2). Michie et al. (1994) and \nreferences therein give alternative analyses of the head injury data. \n\nNOLR has provided an interpretable network model for ordered classes, the missing data \nstrategy successfully learns from incomplete training examples and classifies incomplete \nfuture examples, and the predictive approach is beneficial. \n\nIFor each completion T, X jU of the training set we form a mixture approximation (7) to p(O I \nT, XjU }, sample from this 10,000 times and average the predicted probabilities. These predictions are \naveraged over completions. Maxima were found by running the optimizer 50 times from randomized \nstarting weights. Up to 26 distinct maxima were found and approximately 5 generally accounted \nfor over 95% of the posterior probability in most cases. Table 3 gives an example: averaging over \nmaxima has greater effect than sampling around them, although both are useful. The cut-points cI> \nin the NOLR model must satisfy order constraints, so we rejected samples of () where these did not \nhold. However, the parameters were sufficiently well determined that this occurred in less than 0.5% \nof samples. \n\n\f556 \n\nM. Mathieson \n\nsevere disability \n\ngood recovery \n\nFigure 2: (a) Test set calibration for median imputation (dashed) and conditional imputation (solid). \nFor predictions by conditional imputation we average p( c 1 xo, XU) over 100 pseudo-independent \nsamples from p(XU 1 Xo). Ticks on the lower (upper) axis denote predicted probabilities for the \ntest examples using median (conditional) imputation. (b) In 100 pseudo-independent conditional \nimputations of the missing parts XU of a particular incomplete test example eight distinct values xf \n(i = 1, . . . ,8) occur. (Recall that all components of x are discrete.) For each distinct imputation \nwe plot a circle with centre corresponding to (p(1 1 xO,xf),p(2 1 xO,xf),p(3 1 xO,xf)) and \narea proportional to the number of occurrences of xf in the 100 imputations. The prediction by \nmedian imputation is located by x; the average prediction over conditional imputations is located \nby \u2022 . Actual outcome is 'good recovery'. The conditional method correctly classifies the example \nand shows that the example is close to the Bayes decision boundary under loss L(j, k) =1 j - k 1 \n(dashed). Median imputation results in a confident and incorrect classification. \n\nSoftware: A software library for fitting NOLR models in S-Plus is available at URL \nhttp://www.stats.ox.ac.uk/-mathies \n\nAcknowledgements: The author thanks Brian Ripley for productive discussions of this \nwork and Gordon Murray for permission to use the head injury dataset. This research was \nfunded by the UK EPSRC and investment managers GMO Woolley Ltd. \n\nReferences \nJennett, B., Teasdale, G., Braakman, R., Minderhoud, J., Heiden, J. & Kurze, T. (1979) \n\nPrognosis of patients with severe head injury. Neurosurgery, 4782-790. \n\nLi, K.-H. (1988) Imputation using Markov chains. Journal of Statistical Computation and \n\nSimulation, 3057-79. \n\nLittle, R. & Rubin, D. B. (1987) Statistical Analysis with Missing Data. (Wiley, New York). \nMathieson, M. J. (1996) Ordinal models for neural networks. In Neural Networks in Fi(cid:173)\n\nnancial Engineering, eds A.-P. Refenes, Y. Abu-Mostafa, J. Moody and A. S. Weigend \n(World Scientific, Singapore) 523-536. \n\nMcCullagh, P. (1980) Regression models for ordinal data. Journal of the Royal Statistical \n\nSociety Series B, 42 109-142. \n\nMichie, D., Spiegelhalter, D . J. & Taylor, C. C. (eds) (1994) Machine Learning, Neural and \n\nStatistical Classification. (Ellis Horwood, New York). \n\nRipley, B. D. (1994) Flexible non-linear approaches to classification. In From Statistics \nto Neural Networks. Theory and Pattern Recognition Applications, eds V. Cherkassky, \nJ. H. Friedman and H. Wechsler (Springer Verlag, New York) 108-126. \n\nRipley, B. D. (1996) Pattern Recognition and Neural Networks. (Cambridge University \n\nPress, Cambridge). \n\n\f", "award": [], "sourceid": 1241, "authors": [{"given_name": "Mark", "family_name": "Mathieson", "institution": null}]}