{"title": "A Second Order Cone programming Formulation for Classifying Missing Data", "book": "Advances in Neural Information Processing Systems", "page_first": 153, "page_last": 160, "abstract": null, "full_text": " A Second order Cone Programming\n Formulation for Classifying Missing Data\n\n\n\n Chiranjib Bhattacharyya\n Department of Computer Science and Automation\n Indian Institute of Science\n Bangalore, 560 012, India\n chiru@csa.iisc.ernet.in\n\n\n Pannagadatta K. S. Alexander J. Smola\n Department of Electrical Engineering Machine Learning Program\n Indian Institute of Science National ICT Australia and ANU\n Bangalore, 560 012, India Canberra, ACT 0200, Australia\n pannaga@ee.iisc.ernet.in Alex.Smola@anu.edu.au\n\n\n\n\n Abstract\n\n We propose a convex optimization based strategy to deal with uncertainty\n in the observations of a classification problem. We assume that instead\n of a sample (xi, yi) a distribution over (xi, yi) is specified. In particu-\n lar, we derive a robust formulation when the distribution is given by a\n normal distribution. It leads to Second Order Cone Programming formu-\n lation. Our method is applied to the problem of missing data, where it\n outperforms direct imputation.\n\n\n\n1 Introduction\n\nDenote by (x, y) X Y patterns with corresponding labels. The typical machine learning\nformulation only deals with the case where (x, y) are given exactly. Quite often, however,\nthis is not the case -- for instance in the case of missing values we may be able (using a\nsecondary estimation procedure) to estimate the values of the missing variables, albeit with\na certain degree of uncertainty. It is therefore only natural to take the decreased reliability of\nsuch data into account and design estimators accordingly. What we propose in the present\npaper goes beyond the traditional imputation strategy where missing values are estimated\nand then used as if they had actually been observed. The key difference in what follows is\nthat we will require that with high probability any (~\n xi, yi) pair, where ~\n xi is drawn from a\ndistribution of possible xi, will be estimated correctly. For the sake of simplicity we limit\nourselves to the case of binary classification.\n\nThe paper is organized as follows: Section 2 introduces the problem of classification with\nuncertain data. We solve the equations arising in the context of normal random variables\nin Section 3 which leads to a Second Order Cone Program (SOCP). As an application\nthe problem of classification with missing variables is described in Section 4. We report\nexperimental results in Section 5.\n\n\f\n2 Linear Classification using Convex Optimization\n\nAssume we have m observations (xi, yi) drawn iid (independently and identically dis-\ntributed) from a distribution over X Y, where X is the set of patterns and Y = {1} are\nthe labels (e.g. the absence/presence of a particular object). It is our goal to find a function\nf : X Y which classifies observations x into classes +1 and -1.\n\n2.1 Classification with Certainty\n\nAssume that X is a dot product space and f is a linear function\n f (x) = sgn( w, x + b). (1)\n\nIn the case of linearly separable datasets we can find (w, b) which separates the two classes.\nUnfortunately, such separation is not always possible and we need to allow for slack in the\nseparation of the two sets. Consider the formulation\n\n m\n\n minimize i (2a)\n w,b, i=1\n\n subject to yi ( w, xi + b) 1 - i, i 0, w W for all 1 i m (2b)\nIt is well known that this problem minimizes an upper bound on the number of errors. The\nlatter occur whenever i 1, where i are the slack variables. The Euclidean norm of\n w = w, w , is upper bounded by a user defined constant W . This is equivalent\nto lower bounding the margin, or the separation between the two classes. The resulting\ndiscriminant surface is called the generalized optimal hyperplane [9]. The statement of (2)\nis slightly nonstandard. Typically one states the SVM optimization problem as follows [3]:\n\n\n 1 m\n minimize w 2 + C i (3a)\n w,b, 2 i=1\n\n subject to yi ( w, xi + b) 1 - i, i 0 for all 1 i m (3b)\nInstead of the user defined parameter W , the formulation (3) uses another parameter C.\nFor a proper choice of C and W the two formulations are equivalent. For the purpose of\nthe present paper, however, (2) will be much more easily amenable to modifications and to\ncast the resulting problem as a second order cone program (SOCP).\n\n\n2.2 Classification with Uncertainty\n\nSo far we assumed that the (xi, yi) pairs are known with certainty. We now relax this to the\nassumption that we only have a distribution over the xi, that is (Pi, yi) at our disposition\n(due to a sampling procedure, missing variables, etc.). Formally xi Pi. In this case it\nmakes sense to replace the constraints (2b) of the optimization problem (2) by\n\n subject to Pr {yi ( w, xi + b) 1 - i} i, i 0, w W 1 i m (4)\nHere we replaced the linear classification constraint by a probabilistic one, which is re-\nquired to hold with probability i (0, 1]. This means that by choosing a value of i close\nto 1 we can find a conservative classifier which will classify even very infrequent (xi, yi)\npairs correctly. Hence i provides robustness of the estimate with respect to deviating xi.\n\nIt is clear that unless we impose further restrictions on Pi, it will be difficult to minimize the\nobjective m \n i=1 i with the constraints (4) efficiently. In the following we will consider the\nspecial cases of gaussian uncertainty for which a mathematical programming formulation\ncan be found.\n\n\f\n3 Normal Distributions\n\nFor the purpose of this section we assume that Pi = N(xi, i), i.e., xi is drawn from a\nGaussian distribution with mean \n xi and covariance i. We will not require that i has full\nrank. This means that the uncertainty about xi may be limited to individual coordinates or\nto a subspace of X. As we shall see, this problem can be posed as SOCP.\n\n3.1 Robust Classification\n\nUnder the above assumptions, the probabilistic constraint (4) becomes\n\n subject to Pr {yi ( w, xi + b) 1 - i} i where xi N(xi, i) (5a)\n i 0, w W for all 1 i m (5b)\nThe stochastic constraint can be restated as a deterministic optimization problem\n\n z y\n Pr i - zi ib + i - 1 - zi\n i (6)\n z \n i zi\n\nwhere z w x\n i := -yi i is a normal random variable with mean \n zi and variance 2 :=\n zi\nw w\n i . Consequently (zi- zi)/z is a random variable with zero mean and unit variance\n i\nand we can compute the lhs of (6) by evaluating the cumulative distribution function for\nnormal distributions\n 1 u\n (u) := e- s22 ds.\n 2 -\nIn summary, (6) is equivalent to the condition\n\n y\n ib + i - 1 - zi\n i.\n zi\n\nwhich can be solved (since (u) is monotonic and invertible), for the argument of and\nobtain a condition on its argument\n\n y wT w\n i(w \n xi + b) 1 - i + i i , i = -1(i) (7)\nWe now proceed to deriving a mathematical programming formulation.\n\n\n3.2 Second Order Cone Programming Formulation\n\nDepending on i we can distinguish between three different cases. First consider the case\nwhere i = 0 or i = 0.5. This means that the second order cone part of the constraint (7)\nreduces to the linear inequality of (2b). In other words, we recover the linear constraint of\na standard SVM.\n\nSecondly consider the case i < 0 or i < 0.5. This means that the constraint (7) describes\na concave set, which turns the linear classification task into a hard optimization problem.\nHowever, it is not very likely that anyone would like to impose such constraints which hold\nonly with low probability. After all, uncertain data requires the constraint to become more\nrestrictive in holding not only for a guaranteed point xi but rather for an entire set.\n\nLastly consider the case i > 0 or i > 0.5 second order cone constraint. In this case (7)\ndescribes a convex set in in w, b, i. We obtain the following optimization problem:\n m\n\n minimize i (8a)\n w,b, i=1\n\n 1\n subject to y 2 w\n i(w xi + b) 1 - i + i and \n i i 0 1 i m (8b)\n w W (8c)\n\n\f\nThese problems can be solved efficiently by publicly available codes: recent advances in\nInterior point methods for convex nonlinear optimization [8] have made such problems\nfeasible. As a special case of convex nonlinear optimization SOCPs have gained much\nattention in recent times. For a further discussion of efficient algorithms and applications\nof SOCP see [6].\n\n\n3.3 Worst Case Prediction\n\nNote that if at optimality i > 0, the hyperplane intersects with the constraint set\nB(xi, i, i). Moreover, at a later stage we will need to predict the class label to asses\non which side of the hyperplane B lies. If the hyperplane intersects B we will end up with\ndifferent predictions for points in the different half spaces. In such a scenario a worst case\nprediction, y can be\n w, x\n y = sgn(z) sgn(h - ) i + b\n where = -1(), z = and h = |z|. (9)\n w w\n\nHere sgn(z) gives us the sign of the point in the center of the ellipsoid and (h - ) is\nthe distance of z from the center. If the hyperplane intersects the ellipsoid, the worst case\nprediction is then the prediction for all points which are in the opposite half space of the\ncenter (xi). Plugging = 0.5, i.e., = 0 into (9) yields the standard prediction (1).\nIn such a case h can serve as a measure of confidence as to how well the discriminating\nhyperplane classifies the mean(xi) correctly.\n\n3.4 Set Constraints\n\nThe same problem as (8) can also be obtained by considering that the uncertainty in each\ndatapoint is characterized by an ellipsoid\n\n B(xi, i, i) = {x : (x - xi) -1(x\n i - xi) 2i} (10)\nin conjunction with the constraint\n\n yi ( w, x + b) 1 - i for all x Si (11)\nwhere Si = B(xi, i, i) As before i = -1(i) for i 0. In other words, we have\ni = 0 only when the hyperplane w x + b = 0 does not intersect the ball B(xi, i, i).\n\nNote that this puts our optimization setting into the same category as the knowledge-based\nSVM, and SDP for invariances as all three deal with the above type of constraint (11).\nMore to the point, in [5] Si = S(xi, ) is a polynomial in which describes the set\nof invariance transforms of xi (such as distortion or translation). [4] define Si to be a\npolyhedral \"knowledge\" set, specified by the intersection of linear constraints.\n\nSuch considerations suggest yet another optimization setting: instead of specifying a poly-\nhedral set Si by constraints we can also specify it by its vertices. In particular, we may set\nSi to be the convex hull of a set as in Si = co{xij for 1 j mi}. By the convexity of\nthe constraint set itself it follows that a necessary and sufficient condition for (11) to hold\nis that the inequality holds for all x {xij for 1 j mi}. Consequently we can replace\n(11) by yi ( w, xij + b) 1 - i Note that the index ranges over j rather than i. Such a\nsetting allows us to deal with uncertainties, e.g. regarding the range of variables, which are\njust given by interval boundaries, etc. The table below summarizes the five cases:\n\n Name Set Si Optimization Problem\n Plain SVM[3] {xi} Quadratic Program\n Knowledge Based SVM[4] Polyhedral set Quadratic Program\n Invariances [5] trajectory of polynomial Semidefinite Program\n Normal Distribution B(xi, i, i) Second Order Cone Program\n Convex Hull co{xij 1 j mi} Quadratic Program\n\n\f\nClearly all the above constraints can be mixed and matched and it is likely that there will be\nmore additions to this table in the future. More central is the notion of stating the problems\nvia (11) as a starting point.\n\n\n4 Missing Variables\n\nIn this section we discuss how to address the missing value problem. Key is how to obtain\nestimates of the uncertainty in the missing variables. Since our optimization setting allows\nfor uncertainty in terms of a normal distribution we attempt to estimate the latter directly.\nIn other words, we assume that x|y is jointly normal with mean y and covariance y.\nHence we have the following two-stage procedure to deal with missing variables:\n\n Estimate y,y from incomplete data, e.g. by means of the EM algorithm.\n Use the conditionally normal estimates of xmissing|(xobserved,y) in the optimiza-\n tion problem. This can then be cast in terms of a SOCP as described in the previous\n section.\n\nNote that there is nothing to prevent us from using other estimates of uncertainty and use\ne.g. the polyhedral constraints subsequently. However, for the sake of simplicity we focus\non normal distributions in this paper.\n\n\n4.1 Estimation of the model parameters\n\nWe now detail the computation of the mean and covariance matrices for the datapoints\nwhich have missing values. We just sketch the results, for a detailed derivation see e.g. [7].\n\nLet x Rd, where xa Rda be the vector whose values are known, while xm Rd-da\nbe the vector consisting of missing variables. Assuming a jointly normal distribution in x\nwith mean and covariance it follows that\n\n xm|xa N(m + am-1(x -1\n aa a - a), mm - am aa am). (12)\nHere we decomposed , according to (xa, xm) into\n\n \n = ( aa am\n a, m) and = . (13)\n \n am mm\n\n\nHence, knowing , we can estimate the missing variables and determine their degree of\nuncertainty. One can show that [7] to obtain , the EM algorithm reads as follows:\n\n 1. Initialize , .\n 2. Estimate xm|xa for all observations using (12).\n 3. Recompute , using the completed data set and go to step 2.\n\n\n4.2 Robust formulation for missing values\n\nAs stated above, we model the missing variables as Gaussian random variables, with its\nmean and covariance given by the model described in the previous section. The standard\npractice for imputation is to discard the covariance and treat the problem as a deterministic\nproblem, using the mean as surrogate. But using the robust formulation (8) one can as well\naccount for the covariance.\n\nLet ma be number of datapoints for which all the values are available, while mm be the\nnumber of datapoints containing missing values. Then the final optimization problem reads\n\n\f\nas follows:\n\n m\n\n minimize i (14)\n w,b, i=1\n\n subject to yi ( w, xi + b) 1 - i 1 i ma\n 1\n y 2 w\n j (w xj + b) 1 - j + -1(j) j ma + 1 j ma + mm\n i 0 1 i ma + mm\n w W\nThe mean xj has two components; xaj has values available, while the imputed vector is\ngiven by ^\n xmj, via (12). The matrix j has all entries zero except those involving the\nmissing values, given by Cj, computed via (12).\n\nThe formulation (14) is an optimization problem which involves minimizing a linear ob-\njective over linear and second order cone constraints. At optimality the values of w, b,\ncan be used to define a classifier (1). The resulting discriminator can be used to predict\nthe the class label of a test datapoint having missing variables by a process of conditional\nimputation as follows.\n\nPerform the imputation process assuming that the datapoint comes from class 1(class with\nlabel y = 1). Specifically compute the mean and covariance, as outlined in section 4.1,\nand denote them by 1 and 1 (see (13)) respectively. The training dataset of class 1 is to\nbe used in the computation of 1 and 1. Using the estimated 1 and 1 compute h as\ndefined in (9), and denote it by h1. Compute the label of 1 with the rule (1), call it y1.\n\nAssuming that the test data comes from class 2 (with label y = -1) redo the entire process\nand denote the resulting mean, covariance, and h by 2, 2, h2 respectively. Denote by y2\nthe label of 2 as predicted by (1). We decide that the observation belongs to class with\nlabel y as\n\n y = y2 if h1 < h2 and y = y1 otherwise (15)\n\nThe above rule chooses the prediction with higher h value or in other words the classifier\nchooses the prediction about which it is more confident. Using y, h1, h2 as in (15), the\nworst case prediction rule (9) can be modified as follows\n\n y = y sgn(h - ) where = -1() and h = max(h1,h2) (16)\nIt is our hypothesis that the formulation (14) along with this decision rule is robust to\nuncertainty in the data.\n\n\n5 Experiments with the Robust formulation for missing values\n\nExperiments were conducted to evaluate the proposed formulation (14), against the stan-\ndard imputation strategy. The experiment methodology consisted of creating a dataset of\nmissing values from a completely specified dataset. The robust formulation (14) was used\nto learn a classifier on the dataset having missing values. The resulting classifier was used\nto give a worst case prediction (16), on the test data. Average number of disagreements was\ntaken as the error measure. In the following we describe the methodology in more detail.\n\nConsider a fully specified dataset, D = {(xi, yi)|xi Rd, yi {1}1 i N} having\nN observations, each observation is a d dimensional vector (xi) and labels yi. A certain\nfraction(f ) of the observations were randomly chosen. For each of the chosen datapoints\ndm(= 0.5d) entries were randomly deleted. This then creates a dataset having N datapoints\nout of which Nm(= f N, 0 f 1) of them have missing values. This data is then\n\n\f\nrandomly partitioned into test set and training set in the ratio 1 : 9 respectively. We do this\nexercise to generate 10 different datasets and all our results are averaged over them.\n\nAssuming that the conditional probability distribution of the missing variables given the\nother variables is a gaussian, the mean(xj) and the covariance ( ^\n Cj) can be estimated by the\nmethods described in (4.1). The robust optimization problem was then solved for different\nvalues of . The parameter j(= ) is set to the same value for all the Nm datapoints. For\neach value of the worst case error is recorded.\n\nExperimental results are reported for three public domain datasets downloaded from uci\nrepository ([2]). Pima(N = 768, d = 8), Heart ( N = 270, d = 13), and Ionosphere(N =\n351, d = 34), were used for experiments.\n\nSetting = 0.5, yields the generalized optimal hyperplane formulation, (2). The general-\nized optimal hyperplane will be referred to as the nominal classifier. The nominal classifier\nconsiders the missing values are well approximated by the mean (xj), and there is no un-\ncertainty.\n\n 0.6 0.5\n robust robust\n nomwc nomwc\n robust robustwc robustwc\n nomwc 0.55\n robustwc\n\n\n\n 0.5\n\n\n 0.4\n\n 0.45\n\n\n\n 0.4 0.4\n\n\n\n\n 0.35\n\n 0.3\n\n\n 0.3\n\n\n\n\n 0.25\n\n\n 0.5 0.6 0.7 0.8 0.9 1\n\n 0.2 0.2\n 0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1\n\n 0.42 0.4 0.45\n robust robust robust\n nomwc nomwc nomwc\n 0.41 robustwc robustwc robustwc\n\n\n\n 0.4 0.4\n\n\n\n 0.39\n\n 0.3\n 0.38 0.35\n\n\n\n 0.37\n\n\n\n 0.36 0.3\n\n\n\n 0.35\n 0.2\n\n\n 0.34 0.25\n\n\n\n 0.33\n 0.5 0.6 0.7 0.8 0.9 1\n\n 0.32 0.2\n 0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1\n\n\n\n\n\nFigure 1: Performance of the robust programming solution for various datasets of the UCI\ndatabase. From left to right: Pima, Ionosphere, and Heart dataset. Top: small fraction\nof data with missing variables (50%), Bottom: large number of observations with missing\nvariables (90%)\n\nThe experimental results are summarized by the graphs(1). The robust classifier almost\nalways outperforms the nominal classifier in the worst case sense (compare nomwc and\nrobustwc). Results are presented for low(f = 0.5), and high (f = 0.9) number of missing\nvalues. The results show that for low number of missing values(f = 0.5) the robust classi-\nfier is marginally better than the nominal classifier the gain but for large f = 0.9 the gain\n\n\f\nis significant. This confirms that the imputation strategy fails for high noise.\n\nThe standard misclassification error for the robust classifier, using the standard prediction\n(1), is also shown in the graph with the legend robust. As expected the robust classifier\nperformance does not deteriorate in the standard misclassification sense as is increased.\n\nIn summary the results seems to suggest that for low noise level the nominal classifier\ntrained on imputed data performs as good as the robust formulation. But for high noise\nlevel the robust formulation yields dividends in the worst case sense.\n\n\n6 Conclusions\n\nAn SOCP formulation was proposed for classifying noisy observations and the resulting\nformulation was applied to the missing data case. In the worst case sense the classifier\nshows a better performance over the standard imputation strategy. Closely related to this\nwork is the Total Support Vector Classification(TSVC) formulation, presented in [1]. The\nTSVC formulation tries to reconstruct the original maximal margin classifier in the pres-\nence of noisy data. Both TSVC formulation and the approach in this paper address the issue\nof uncertainty in input data and it would be an important research direction to compare the\ntwo approaches.\n\n\nAcknowledgements CB was partly funded by ISRO-IISc Space technology cell (Grant\nnumber IST/ECA/CB/152). National ICT Australia is funded through the Australian Gov-\nernment's Backing Australia's Ability initiative, in part through the Australian Research\nCouncil. AS was supported by grants of the ARC. We thank Laurent ElGhaoui, Michael\nJordan, Gunnar Ratsch, and Frederik Schaffalitzky for helpful discussions and comments.\n\n\nReferences\n\n[1] J. Bi and T. Zhang. Support vector classification with input data uncertainty. In Ad-\n vances in Neural Information Processing Systems. MIT Press, 2004.\n\n[2] C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.\n\n[3] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273297,\n 1995.\n\n[4] G. Fung, O. L. Mangasarian, and Jude Shavlik. Knowledge-based support vector ma-\n chine classifiers. In Advances in Neural Information Processing Systems. MIT Press,\n 2002.\n\n[5] Thore Graepel and Ralf Herbrich. Invariant pattern recognition by semidefinite pro-\n gramming machines. In Advances in Neural Information Processing Systems 16, Cam-\n bridge, MA, 2003. MIT Press.\n\n[6] M.S. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret. Applications of second-order\n cone programming. Linear Algebra and its Applications, 284(13):193228, 1998.\n\n[7] K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate Analysis. Academic Press,\n 1979.\n\n[8] Y. Nesterov and A. Nemirovskii. Interior Point Algorithms in Convex Programming.\n Number 13 in Studies in Applied Mathematics. SIAM, Philadelphia, 1993.\n\n[9] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.\n\n\f\n", "award": [], "sourceid": 2670, "authors": [{"given_name": "Chiranjib", "family_name": "Bhattacharyya", "institution": null}, {"given_name": "Pannagadatta", "family_name": "Shivaswamy", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}]}