{"title": "t-logistic regression", "book": "Advances in Neural Information Processing Systems", "page_first": 514, "page_last": 522, "abstract": "We extend logistic regression by using t-exponential families which were introduced recently in statistical physics. This gives rise to a regularized risk minimization problem with a non-convex loss function. An efficient block coordinate descent optimization scheme can be derived for estimating the parameters. Because of the nature of the loss function, our algorithm is tolerant to label noise. Furthermore, unlike other algorithms which employ non-convex loss functions, our algorithm is fairly robust to the choice of initial values. We verify both these observations empirically on a number of synthetic and real datasets.", "full_text": "t-Logistic Regression\n\nNan Ding2, S.V. N. Vishwanathan1,2\n\nDepartments of 1Statistics and 2Computer Science\n\nPurdue University\n\nding10@purdue.edu, vishy@stat.purdue.edu\n\nAbstract\n\nWe extend logistic regression by using t-exponential families which were intro-\nduced recently in statistical physics. This gives rise to a regularized risk mini-\nmization problem with a non-convex loss function. An ef\ufb01cient block coordinate\ndescent optimization scheme can be derived for estimating the parameters. Be-\ncause of the nature of the loss function, our algorithm is tolerant to label noise.\nFurthermore, unlike other algorithms which employ non-convex loss functions,\nour algorithm is fairly robust to the choice of initial values. We verify both these\nobservations empirically on a number of synthetic and real datasets.\n\n1\n\nIntroduction\n\nMany machine learning algorithms minimize a regularized risk [1]:\n\nJ(\u03b8) = \u03a9(\u03b8) + Remp(\u03b8), where Remp(\u03b8) =\n\n1\nm\n\nl(xi, yi, \u03b8).\n\n(1)\n\nm\ufffdi=1\n\nHere, \u03a9 is a regularizer which penalizes complex \u03b8; and Remp, the empirical risk, is obtained by\naveraging the loss l over the training dataset {(x1, y1), . . . , (xm, ym)}. In this paper our focus is on\nbinary classi\ufb01cation, wherein features of a data point x are extracted via a feature map \u03c6 and the\nlabel is usually predicted via sign(\ufffd\u03c6(x), \u03b8\ufffd). If we de\ufb01ne the margin of a training example (x, y) as\nu(x, y, \u03b8) := y \ufffd\u03c6(x), \u03b8\ufffd, then many popular loss functions for binary classi\ufb01cation can be written\nas functions of the margin. Examples include1\n\n(0 \u2212 1 loss)\n(Hinge Loss)\n(Exponential Loss)\n(Logistic Loss).\n\nl(u) = 0 if u > 0 and 1 otherwise .\nl(u) = max(0, 1 \u2212 u)\nl(u) = exp(\u2212u)\nl(u) = log(1 + exp(\u2212u))\n\n(2)\n(3)\n(4)\n(5)\nThe 0 \u2212 1 loss is non-convex and dif\ufb01cult to handle; it has been shown that it is NP-hard to even\napproximately minimize the regularized risk with the 0 \u2212 1 loss [2]. Therefore, other loss functions\ncan be viewed as convex proxies of the 0 \u2212 1 loss. Hinge loss leads to support vector machines\n(SVMs), exponential loss is used in Adaboost, and logistic regression uses the logistic loss.\nConvexity is a very attractive property because it ensures that the regularized risk minimization\nproblem has a unique global optimum [3]. However, as was recently shown by Long and Servedio\n[4], learning algorithms based on convex loss functions are not robust to noise2. Intuitively, the\nconvex loss functions grows at least linearly with slope |l\ufffd(0)| as u \u2208 (\u2212\u221e, 0), which introduces\nthe overwhelming impact from the data with u \ufffd 0. There has been some recent and some not-\nso-recent work on using non-convex loss functions to alleviate the above problem. For instance, a\nrecent manuscript by [5] uses the cdf of the Guassian distribution to de\ufb01ne a non-convex loss.\n\n1We slightly abuse notation and use l(u) to denote l(u(x, y, \u03b8)).\n2Although, the analysis of [4] is carried out in the context of boosting, we believe, the results hold for a\n\nlarger class of algorithms which minimize a regularized risk with a convex loss function.\n\n1\n\n\floss\n\nexp\n\nHinge\n\n6\n\n4\n\n2\n\nLogistic\n\nIn this paper, we continue this line of inquiry and propose a non-convex loss function which is\n\ufb01rmly grounded in probability theory. By\nextending logistic regression from the ex-\nponential family to the t-exponential fam-\nily, a natural extension of exponential family\nof distributions studied in statistical physics\n[6\u201310], we obtain the t-logistic regression\nalgorithm. Furthermore, we show that a\nsimple block coordinate descent scheme can\nbe used to solve the resultant regularized\nrisk minimization problem. Analysis of this\nprocedure also intuitively explains why t-\nlogistic regression is able to handle label\nnoise.\nOur paper is structured as follows: In sec-\ntion 2 we brie\ufb02y review logistic regression\nespecially in the context of exponential fam-\nilies. In section 3, we review t-exponential families, which form the basis for our proposed t-logistic\nregression algorithm introduced in section 4. In section 5 we utilize ideas from convex multiplica-\ntive programming to design an optimization strategy. Experiments that compare our new approach\nto existing algorithms on a number of publicly available datasets are reported in section 6, and the\npaper concludes with a discussion and outlook in section 7. Some technical details as well as extra\nexperimental results can be found in the supplementary material.\n\nFigure 1: Some commonly used loss functions for binary\nclassi\ufb01cation. The 0-1 loss is non-convex. The hinge, expo-\nnential, and logistic losses are convex upper bounds of the\n0-1 loss.\n\n0-1 loss\n\nmargin\n\n-4\n\n-2\n\n0\n\n2\n\n4\n\n2 Logistic Regression\n\nSince we build upon the probabilistic underpinnings of logistic regression, we brie\ufb02y review some\nsalient concepts. Details can be found in any standard textbook such as [11] or [12]. Assume we are\ngiven a labeled dataset (X, Y) = {(x1, y1), . . . , (xm, ym)} with the xi\u2019s drawn from some domain\nX and the labels yi \u2208 {\u00b11}. Given a family of conditional distributions parameterized by \u03b8, using\nBayes rule, and making a standard iid assumption about the data allows us to write\n\np(\u03b8 | X, Y) = p(\u03b8)\n\np(yi| xi; \u03b8)/p(Y | X) \u221d p(\u03b8)\n\np(yi| xi; \u03b8)\n\n(6)\n\nm\ufffdi=1\n\nm\ufffdi=1\n\nwhere p(Y | X) is clearly independent of \u03b8. To model p(yi| xi; \u03b8), consider the conditional expo-\nnential family of distributions\n\nwith the log-partition function g(\u03b8 | x) given by\n\np(y| x; \u03b8) = exp (\ufffd\u03c6(x, y), \u03b8\ufffd \u2212 g(\u03b8 | x)) ,\n\ng(\u03b8 | x) = log (exp (\ufffd\u03c6(x, +1), \u03b8\ufffd) + exp (\ufffd\u03c6(x,\u22121), \u03b8\ufffd)) .\n\n(8)\n2 \u03c6(x), and denote u = y \ufffd\u03c6(x), \u03b8\ufffd then it is easy to see\n\nIf we choose the feature map \u03c6(x, y) = y\nthat p(y| x; \u03b8) is the logistic function\n\np(y| x; \u03b8) =\n\nexp(u/2) + exp(\u2212u/2)\nBy assuming a zero mean isotropic Gaussian prior N (0, 1\u221a\u03bb\nlogarithms, we can rewrite (6) as\n\nexp(u/2)\n\n=\n\n1\n\n.\n\n1 + exp(\u2212u)\nI) for \u03b8, plugging in (9), and taking\n\n(7)\n\n(9)\n\n\u2212 log p(\u03b8 | X, Y) =\n\n\u03bb\n2 \ufffd\u03b8\ufffd2 +\n\nlog (1 + exp (\u2212yi \ufffd\u03c6(xi), \u03b8\ufffd)) + const. .\n\n(10)\n\nLogistic regression computes a maximum a-posteriori (MAP) estimate for \u03b8 by minimizing (10) as\na function of \u03b8. Comparing (1) and (10) it is easy to see that the regularizer employed in logistic\n2 \ufffd\u03b8\ufffd2, while the loss function is the negative log-likelihood \u2212 log p(y| x; \u03b8), which\nregression is \u03bb\nthanks to (9) can be identi\ufb01ed with the logistic loss (5).\n\nm\ufffdi=1\n\n2\n\n\f3\n\nt-Exponential family of Distributions\n\nIn this section we will look at generalizations of the log and exp functions which were \ufb01rst intro-\nduced in statistical physics [6\u20139]. Some extensions and machine learning applications were pre-\nsented in [13]. In fact, a more general class of functions was studied in these publications, but for\nour purposes we will restrict our attention to the so-called t-exponential and t-logarithm functions.\nThe t-exponential function expt for (0 < t < 2) is de\ufb01ned as follows:\nif t = 1\notherwise.\n\nexpt(x) :=\ufffdexp(x)\n\n(11)\n\n[1 + (1 \u2212 t)x]1/(1\u2212t)\n\n+\n\nwhere (\u00b7)+ = max(\u00b7, 0). Some examples are shown in Figure 2. Clearly, expt generalizes the usual\nexp function, which is recovered in the limit as t \u2192 1. Furthermore, many familiar properties of exp\nare preserved: expt functions are convex, non-decreasing, non-negative and satisfy expt(0) = 1 [9].\nBut expt does not preserve one very important property of exp, namely expt(a + b) \ufffd= expt(a) \u00b7\nexpt(b). One can also de\ufb01ne the inverse of expt namely logt as\n\nlogt(x) :=\ufffdlog(x)\n\n\ufffdx1\u2212t \u2212 1\ufffd /(1 \u2212 t)\n\nif t = 1\notherwise.\n\n(12)\n\nSimilarly, logt(ab) \ufffd= logt(a) + logt(b). From Figure 2, it is clear that expt decays towards 0 more\nslowly than the exp function for 1 < t < 2. This important property leads to a family of heavy\ntailed distributions which we will later exploit.\n\nt = 1.5\n\nexp(x)\n\nt = 0.5\nt \u2192 0\n\nexpt\n7\n6\n5\n4\n3\n2\n1\n\n-3 -2 -1 0 1 2\n\nx\n\nlogt\n\n2\n1\n0\n-1\n-2\n-3\n\nt \u2192 0\n\nt = 0.5\n\nlog(x)\n\nt = 1.5\n\n1 2 3 4 5 6 7\n\nx\n\nt = 1 (logistic)\n\nloss\n\nt = 1.3\nt = 1.6\nt = 1.9\n\n0-1 loss\n\n6\n\n4\n\n2\n\n-4\n\n-2\n\n0\n\n2\n\n4\n\nmargin\n\nFigure 2: Left: expt and Middle: logt for various values of t indicated. The right \ufb01gure depicts the\nt-logistic loss functions for different values of t. When t = 1, we recover the logistic loss\n\nAnalogous to the exponential family of distributions, the t-exponential family of distributions is\nde\ufb01ned as [9, 13]:\n\np(x; \u03b8) := expt (\ufffd\u03c6(x), \u03b8\ufffd \u2212 gt(\u03b8)) .\n\n(13)\nA prominent member of the t-exponential family is the Student\u2019s-t distribution [14]. Just like in the\nexponential family case, gt the log-partition function ensures that p(x; \u03b8) is normalized. However,\nno closed form solution exists for computing gt exactly in general. A closely related distribution,\nwhich often appears when working with t-exponential families is the so-called escort distribution\n[9, 13]:\n\nqt(x; \u03b8) := p(x; \u03b8)t/Z(\u03b8)\n\n(14)\n\nwhere Z(\u03b8) =\ufffd p(x; \u03b8)tdx is the normalizing constant which ensures that the escort distribution\n\nintegrates to 1.\nAlthough gt(\u03b8) is not the cumulant function of the t-exponential family, it still preserves convexity.\nIn addition, it is very close to being a moment generating function\n\u2207\u03b8gt(\u03b8) = Eqt(x;\u03b8) [\u03c6(x)] .\n\n(15)\nThe proof is provided in the supplementary material. A general version of this result appears as\nLemma 3.8 in Sears [13] and a version specialized to the generalized exponential families appears\nas Proposition 5.2 in [9]. The main difference from \u2207\u03b8g(\u03b8) of the normal exponential family is that\nnow \u2207\u03b8gt(\u03b8) is equal to the expectation of its escort distribution qt(x; \u03b8) instead of p(x; \u03b8).\n\n3\n\n\f4 Binary Classi\ufb01cation with the t-exponential Family\n\nIn t-logistic regression we model p(y| x; \u03b8) via a conditional t-exponential family distribution\n\np(y| x; \u03b8) = expt (\ufffd\u03c6(x, y), \u03b8\ufffd \u2212 gt(\u03b8 | x)) ,\nwhere 1 < t < 2, and compute the log-partition function gt by noting that\n\n(16)\n\nexpt (\ufffd\u03c6(x, +1), \u03b8\ufffd \u2212 gt(\u03b8 | x)) + expt (\ufffd\u03c6(x,\u22121), \u03b8\ufffd \u2212 gt(\u03b8 | x)) = 1.\n\n(17)\nEven though no closed form solution exists, one can compute gt given \u03b8 and x using numerical\ntechniques ef\ufb01ciently.\nThe Student\u2019s-t distribution can be regarded as a counterpart of the isotropic Gaussian prior in the\nt-exponential family [14]. Recall that a one dimensional Student\u2019s-t distribution is given by\n\nSt(x|\u00b5, \u03c3, v) =\n\n(18)\nwhere \u0393(\u00b7) denotes the usual Gamma function and v > 1 so that the mean is \ufb01nite. If we select t\nsatisfying \u2212(v + 1)/2 = 1/(1 \u2212 t) and denote,\n\n,\n\n\u0393((v + 1)/2)\n\n\u221av\u03c0\u0393(v/2)\u03c31/2\ufffd1 +\n\n(x \u2212 \u00b5)2\n\nv\u03c3 \ufffd\u2212(v+1)/2\n\nthen by some simple but tedious calculation (included in the supplementary material)\n\n\u03a8 =\ufffd \u0393((v + 1)/2)\n\n\u221av\u03c0\u0393(v/2)\u03c31/2\ufffd\u22122/(v+1)\nSt(x|\u00b5, \u03c3, v) = expt(\u2212\u02dc\u03bb(x \u2212 \u00b5)2/2 \u2212 \u02dcgt)\n\n,\n\nwhere\n\n\u02dc\u03bb =\n\nand\n\n\u02dcgt =\n\n2\u03a8\n\n(t \u2212 1)v\u03c3\n\n\u03a8 \u2212 1\nt \u2212 1\n\n.\n\n(19)\n\n(20)\n\nTherefore, we work with the Student\u2019s-t prior in our setting:\n\np(\u03b8) =\n\np(\u03b8j) =\n\nSt(\u03b8j|0, 2/\u03bb, (3 \u2212 t)/(t \u2212 1)).\n\nd\ufffdj=1\n\nd\ufffdj=1\n\nHere, the degree of freedom for Student\u2019s-t distribution is chosen such that it also belongs to the\nexpt family, which in turn yields v = (3 \u2212 t)/(t \u2212 1). The Student\u2019s-t prior is usually preferred to\nthe Gaussian prior when the underlying distribution is heavy-tailed. In practice, it is known to be a\nrobust3 alternative to the Gaussian distribution [16, 17].\nAs before, if we let \u03c6(x, y) = y\n2 \u03c6(x) and plot the negative log-likelihood \u2212 log p(y| x; \u03b8), then we\nno longer obtain a convex loss function (see Figure 2). Similarly, \u2212 log p(\u03b8) is no longer convex\nwhen we use the Student\u2019s-t prior. This makes optimizing the regularized risk challenging, therefore\nwe employ a different strategy.\nSince logt is also a monotonically increasing function, instead of working with log, we can equiva-\nlently work with the logt function (12) and minimize the following objective function:\n\n\u02c6J(\u03b8) = \u2212 logt p(\u03b8)\n\n=\n\n1\n\np(yi| xi; \u03b8)/p(Y | X)\n\nm\ufffdi=1\nt \u2212 1\ufffdp(\u03b8)\np(yi| xi; \u03b8)/p(Y | X)\ufffd1\u2212t\nm\ufffdi=1\nj /2 \u2212 \u02dcgt)\ufffd\nm\ufffdi=1\ufffd1 + (1 \u2212 t)(\ufffd yi\n\ufffd\n\ufffd\n\nd\ufffdj=1\ufffd1 + (1 \u2212 t)(\u2212\u02dc\u03bb\u03b82\n\ufffd\n\ufffd\ufffd\nd\ufffdj=1\n\nm\ufffdi=1\n\nrj(\u03b8)\n\nrj (\u03b8)\n\n=\n\nli(\u03b8) + const.\n\nwhere p(Y | X) is independent of \u03b8. Using (13), (18), and (11), we can further write\n\u02c6J(\u03b8) \u221d\n\n2\n\n+\n\n1\n1 \u2212 t\n\n,\n\n(21)\n\n\u03c6(xi), \u03b8\ufffd \u2212 gt(\u03b8 | xi))\ufffd\n\ufffd\ufffd\n\ufffd\n\nli(\u03b8)\n\n+const. .\n\n(22)\n\n3There is no unique de\ufb01nition of robustness. For example, one of the de\ufb01nitions is through the outlier-\n\nproneness [15]: p(\u03b8 | X, Y, xn+1, yn+1) \u2192 p(\u03b8 | X, Y) as xn+1 \u2192 \u221e.\n\n4\n\n\fSince t > 1, it is easy to see that rj(\u03b8) > 0 is a convex function of \u03b8. On the other hand, since gt\nis convex and t > 1 it follows that li(\u03b8) > 0 is also a convex function of \u03b8. In summary, \u02c6J(\u03b8) is\na product of positive convex functions. In the next section we will present an ef\ufb01cient optimization\nstrategy for dealing with such problems.\n\n5 Convex Multiplicative Programming\n\nIn convex multiplicative programming [18] we are interested in the following optimization problem:\n\nmin\n\n\u03b8\n\nP(\u03b8) \ufffd\n\nzn(\u03b8)\n\ns.t. \u03b8 \u2208 Rd,\n\n(23)\n\nN\ufffdn=1\n\nwhere zn(\u03b8) are positive convex functions. Clearly, (22) can be identi\ufb01ed with (23) by setting\nN = d+m and identifying zn(\u03b8) = rn(\u03b8) for n = 1, . . . , d and zn+d(\u03b8) = ln(\u03b8) for n = 1, . . . , m.\nThe optimal solutions to the problem (23) can be obtained by solving the following parametric\nproblem (see Theorem 2.1 of Kuno et al. [18]):\n\n\u03ben \u2265 1.\n\nmin\n\nmin\n\n\u03be\n\n\u03b8\n\nN\ufffdn=1\n\nMP(\u03b8, \u03be) \ufffd\n\nln(\u03b8) = \u2212\ufffd yn\n\nThe optimization problem in (24) is very reminiscent of logistic regression. In logistic regression,\n\nN\ufffdn=1\n\u03benzn(\u03b8) s.t. \u03b8 \u2208 Rd, \u03be > 0,\n2 \u03c6(xn), \u03b8\ufffd \u2212 gt(\u03b8 | xn)\ufffd.\n2 \u03c6(xn), \u03b8\ufffd + g(\u03b8 | xn), while here ln(\u03b8) = 1 + (1\u2212 t)\ufffd\ufffd yn\n\nThe key difference is that in t-logistic regression each data point xn has a weight (or in\ufb02uence) \u03ben\nassociated with it.\nExact algorithms have been proposed for solving (24) (for instance, [18]). However, the computa-\ntional cost of these algorithms grows exponentially with respect to N which makes them impractical\nfor our purposes. Instead, we apply a block coordinate descent based method. The main idea is to\nminimize (24) with respect to \u03b8 and \u03be separately.\n\u03be-Step: Assume that \u03b8 is \ufb01xed, and denote \u02dczn = zn(\u03b8) to rewrite (24) as:\n\n(24)\n\nmin\n\n\u03be\n\nN\ufffdn=1\n\nN\ufffdn=1\n\n\u03ben \u02dczn s.t.\n\n\u03be > 0,\n\n\u03ben \u2265 1.\n\n(25)\n\nSince the objective function is linear in \u03be and the feasible region is a convex set, (25) is a con-\nvex optimization problem. By introducing a non-negative Lagrange multiplier \u03b3 \u2265 0, the partial\nLagrangian and its gradient with respect to \u03ben\ufffd can be written as\n\nL(\u03be, \u03b3) =\n\n\u03ben \u02dczn + \u03b3 \u00b7\ufffd1 \u2212\n\n\u03ben\ufffd\n\nN\ufffdn=1\n\nN\ufffdn=1\n\n\u2202\n\u2202\u03ben\ufffd\n\nL(\u03be, \u03b3) = \u02dczn\ufffd \u2212 \u03b3 \ufffdn\ufffd=n\ufffd\n\n\u03ben.\n\n(26)\n\n(27)\n\nSetting the gradient to 0 obtains \u03b3 =\n\n\u02dczn\ufffd\ufffdn\ufffd=n\ufffd \u03ben\nK.K.T. conditions [3], we can conclude that\ufffdN\n\n. Since \u02dczn\ufffd > 0, it follows that \u03b3 cannot be 0. By the\nn=1 \u03ben = 1. This in turn implies that \u03b3 = \u02dczn\ufffd \u03ben\ufffd or\n\n(\u03be1, . . . , \u03beN ) = (\u03b3/\u02dcz1, . . . , \u03b3/\u02dczN ), with \u03b3 =\n\n1\nN\n\n\u02dcz\nn .\n\n(28)\n\nRecall that \u03ben in (24) is the weight (or in\ufb02uence) of each term zn(\u03b8). The above analysis shows\nthat \u03b3 = \u02dczn(\u03b8)\u03ben remains constant for all n. If \u02dczn(\u03b8) becomes very large then its in\ufb02uence \u03ben\nis reduced. Therefore, points with very large loss have their in\ufb02uence capped and this makes the\nalgorithm robust to outliers.\n\u03b8-Step: In this step we \ufb01x \u03be > 0 and solve for the optimal \u03b8. This step is essentially the same as\nlogistic regression, except that each component has a weight \u03be here.\n\nN\ufffdn=1\n\nmin\n\n\u03b8\n\nN\ufffdn=1\n\n\u03benzn(\u03b8) s.t. \u03b8 \u2208 Rd .\n\n5\n\n(29)\n\n\fThis is a standard unconstrained convex optimization problem which can be solved by any off the\nshelf solver. In our case we use the L-BFGS Quasi-Newton method. This requires us to compute\nthe gradient \u2207\u03b8zn(\u03b8):\nfor n = 1, . . . , d \u2207\u03b8zn(\u03b8) = \u2207\u03b8rn(\u03b8) = (t \u2212 1)\u02dc\u03bb\u03b8n \u00b7 en\nfor n = 1, . . . , m \u2207\u03b8zn+d(\u03b8) = \u2207\u03b8ln(\u03b8) = (1 \u2212 t)\ufffd yn\n= (1 \u2212 t)\ufffd yn\n\n\u03c6(xn) \u2212 \u2207\u03b8gt(\u03b8 | xn)\ufffd\n\u03c6(xn) \u2212 Eqt(yn| xn;\u03b8)\ufffd yn\n\nwhere en denotes the d dimensional vector with one at the n-th coordinate and zeros elsewhere (n-th\nunit vector). qt(y| x; \u03b8) is the escort distribution of p(y| x; \u03b8) (16):\n\n\u03c6(xn)\ufffd\ufffd ,\n\n2\n\n2\n\n2\n\n(30)\n\nqt(y| x; \u03b8) =\n\np(y| x; \u03b8)t\n\np(+1| x; \u03b8)t + p(\u22121| x; \u03b8)t .\n\nThe objective function is monotonically decreasing and is guaranteed to converge to a stable point\nof P(\u03b8). We include the proof in the supplementary material.\n\n6 Experimental Evaluation\n\nOur experimental evaluation is designed to answer four natural questions: 1) How does the gener-\nalization capability (measured in terms of test error) of t-logistic regression compare with existing\nalgorithms such as logistic regression and support vector machines (SVMs) both in the presence and\nabsence of label noise? 2) Do the \u03be variables we introduced in the previous section have a natu-\nral interpretation? 3) How much overhead does t-logistic regression incur as compared to logistic\nregression? 4) How sensitive is the algorithm to initialization? The last question is particularly\nimportant given that the algorithm is minimizing a non-convex loss.\nTo answer the above questions empirically we use six datasets, two of which are synthetic. The\nLong-Servedio dataset is an arti\ufb01cially constructed dataset to show that algorithms which minimize\na differentiable convex loss are not tolerant to label noise Long and Servedio [4]. The examples have\n21 dimensions and play one of three possible roles: large margin examples (25%, x1,2,...,21 = y);\npullers (25%, x1,...,11 = y, x12,...,21 = \u2212y); and penalizers (50%, Randomly select and set 5 of\nthe \ufb01rst 11 coordinates and 6 out of the last 10 coordinates to y, and set the remaining coordinates\nto \u2212y). The Mease-Wyner is another synthetic dataset to test the effect of label noise. The input x\nis a 20-dimensional vector where each coordinate is uniformly distributed on [0, 1]. The label y is\n+1 if\ufffd5\nj=1 xj \u2265 2.5 and \u22121 otherwise [19]. In addition, we also test on Mushroom, USPS-N (9\nvs. others), Adult, and Web datasets, which are often used to evaluate machine learning algorithms\n(see Table 1 in supplementary material for details).\nFor simplicity, we use the identity feature map \u03c6(x) = x in all our experiments, and set t \u2208\n{1.3, 1.6, 1.9} for t-logistic regression. Our comparators are logistic regression, linear SVMs4, and\nan algorithm (the probit) which employs the probit loss, L(u) = 1 \u2212 erf (2u), used in Brown-\nBoost/RobustBoost [5]. We use the L-BFGS algorithm [21] for the \u03b8-step in t-logistic regression.\nL-BFGS is also used to train logistic regression and the probit loss based algorithms. Label noise is\nadded by randomly choosing 10% of the labels in the training set and \ufb02ipping them; each dataset is\ntested with and without label noise. We randomly select and hold out 30% of each dataset as a vali-\ndation set and use the rest of the 70% for 10-fold cross validation. The optimal parameters namely \u03bb\nfor t-logistic and logistic regression and C for SVMs is chosen by performing a grid search over the\n\nparameter space\ufffd2\u22127,\u22126,...,7\ufffd and observing the prediction accuracy over the validation set. The\n\nconvergence criterion is to stop when the change in the objective function value is less than 10\u22124.\nAll code is written in Matlab, and for the linear SVM we use the Matlab interface of LibSVM [22].\nExperiments were performed on a Qual-core machine with Dual 2.5 Ghz processor and 32 Gb RAM.\n\nIn Figure 3, we plot the test error with and without label noise. In the latter case, the test error of\nt-logistic regression is very similar to logistic regression and Linear SVM (with 0% test error in\n\n4We also experimented with RampSVM [20], however, the results are worser than the other algorithms. We\n\ntherefore report these results in the supplementary material.\n\n6\n\n\f)\n\n%\n\n(\nr\no\nr\nr\nE\nt\ns\ne\nT\n\n)\n\n%\n\n(\nr\no\nr\nr\nE\nt\ns\ne\nT\n\n32\n\n24\n\n16\n\n8\n\n0\n\n6.0\n\n4.5\n\n3.0\n\n1.5\n\n0.0\n\n6.0\n\n4.5\n\n3.0\n\n1.5\n\n0.0\n\n16.8\n\n16.0\n\n15.2\n\n14.4\n\nlogis.\n\nt=1.3 t=1.6 t=1.9 probit SVM\n\nlogis.\n\nt=1.3 t=1.6 t=1.9 probit SVM\n\n1.2\n\n0.9\n\n0.6\n\n0.3\n\n0.0\n\n3.2\n\n2.4\n\n1.6\n\n0.8\n\n0.0\n\nlogis.\n\nt=1.3 t=1.6 t=1.9 probit SVM\n\nFigure 3: The test error rate of various algorithms on six datasets (left to right, top: Long-Servedio,\nMease-Wyner, Mushroom; bottom: USPS-N, Adult, Web) with and without 10% label noise. All\nalgorithms are initialized with \u03b8 = 0. The blue (light) bar denotes a clean dataset while the magenta\n(dark) bar are the results with label noise added. Also see Table 3 in the supplementary material.\n\nLong-Servedio and Mushroom datasets), with a slight edge on some datasets such as Mease-Wyner.\nWhen label noise is added, t-logistic regression (especially with t = 1.9) shows signi\ufb01cantly5 better\nperformance than all the other algorithms on all datasets except the USPS-N, where it is marginally\noutperformed by the probit.\nTo obtain Figure 4 we used the noisy version of the datasets, chose one of the 10 folds used in the\nprevious experiment, and plotted the distribution of the 1/z \u221d \u03be obtained after training with t = 1.9.\nTo distinguish the points with noisy labels we plot them in cyan while the other points are plotted in\nred. Analogous plots for other values of t can be found in the supplementary material. Recall that \u03be\ndenotes the in\ufb02uence of a point. One can clearly observe that the \u03be of the noisy data is much smaller\nthan that of the clean data, which indicates that the algorithm is able to effectively identify these\npoints and cap their in\ufb02uence. In particular, on the Long-Servedio dataset observe the 4 distinct\nspikes. From left to right, the \ufb01rst spike corresponds to the noisy large margin examples, the second\nspike represents the noisy pullers, the third spike denotes the clean pullers, while the rightmost spike\ncorresponds to the clean large margin examples. Clearly, the noisy large margin examples and the\nnoisy pullers are assigned a low value of \u03be thus capping their in\ufb02uence and leading to the perfect\nclassi\ufb01cation of the test set. On the other hand, logistic regression is unable to discriminate between\nclean and noisy training samples which leads to bad performance on noisy datasets.\nDetailed timing experiments can be found in Table 4 in the supplementary material. In a nutshell,\nt-logistic regression takes longer to train than either logistic regression or the probit. The reasons\nare not dif\ufb01cult to see. First, there is no closed form expression for gt(\u03b8 | x). We therefore resort\nto pre-computing it at some \ufb01xed locations and using a spline method to interpolate values at other\nlocations. Second, since the objective function is not convex several iterations of the \u03be and \u03b8 steps\nmight be needed. Surprisingly, the L-BFGS algorithm, which is not designed to optimize non-\nconvex functions, is able to minimize (22) directly in many cases. When it does converge, it is often\nfaster than the convex multiplicative programming algorithm. However, on some cases (as expected)\nit fails to \ufb01nd a direction of descent and exits. A common remedy for this is the bundle L-BFGS\nwith a trust-region approach. [21]\nGiven that the t-logistic objective function is non-convex, one naturally worries about how different\ninitial values affect the quality of the \ufb01nal solution. To answer this question, we initialized the\nalgorithm with 50 different randomly chosen \u03b8 \u2208 [\u22120.5, 0.5]d, and report test performances of\nthe various solutions obtained in Figure 5. Just like logistic regression which uses a convex loss\nand hence converges to the same solution independent of the initialization, the solution obtained\n\n5We provide the signi\ufb01cance test results in Table 2 of supplementary material.\n\n7\n\n\f300\n\n240\n\n180\n\n120\n\n60\n\ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n0\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n600\n\n450\n\n300\n\n150\n\ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n60\n\n45\n\n30\n\n15\n\n0\n0.0\n\n1200\n\n900\n\n600\n\n300\n\n1000\n\n800\n\n600\n\n400\n\n200\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n0\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n8000\n\n6000\n\n4000\n\n2000\n\n0\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n0\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n0\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n\u03be\n\n\u03be\n\n\u03be\n\nFigure 4: The distribution of \u03be obtained after training t-logistic regression with t = 1.9 on datasets\nwith 10% label noise. Left to right, top: Long-Servedio, Mease-Wyner, Mushroom; bottom: USPS-\nN, Adult, Web. The red (dark) bars (resp. cyan (light) bars) indicate the frequency of \u03be assigned to\npoints without (resp. with) label noise.\n\nby t-logistic regression seems fairly independent of the initial value of \u03b8. On the other hand, the\nperformance of the probit \ufb02uctuates widely with different initial values of \u03b8.\n\nprobit\n\nt = 1.9\n\nt = 1.6\n\nt = 1.3\n\nlogistic\n\nprobit\n\nt = 1.9\n\nt = 1.6\n\nt = 1.3\n\nlogistic\n\n0\n\n10\n\n20\n\n30\n\n0\n\n10\n\n20\n\n30\n\n40\n\n0.00\n\n0.15\n\n0.30\n\n0.45\n\n3.0\n\n4.5\n\n6.0\n\n7.5\nTestError(%)\n\n9.0\n\n15\n\n18\n\n21\n\nTestError(%)\n\n24\n\n1.5\n\n2.0\n\n2.5\n\n3.0\n\n3.5\n\nTestError(%)\n\nFigure 5: The Error rate by different initialization. Left to right, top: Long-Servedio, Mease-Wyner,\nMushroom; bottom: USPS-N, Adult, Web.\n\n7 Discussion and Outlook\n\nIn this paper, we generalize logistic regression to t-logistic regression by using the t-exponential\nfamily. The new algorithm has a probabilistic interpretation and is more robust to label noise. Even\nthough the resulting objective function is non-convex, empirically it appears to be insensitive to ini-\ntialization. There are a number of avenues for future work. On Long-Servedio experiment, if the\nlabel noise is increased signi\ufb01cantly beyond 10%, the performance of t-logistic regression may de-\ngrade (see Fig. 6 in supplementary materials). Understanding and explaining this issue theoretically\nand empirically remains an open problem. It will be interesting to investigate if t-logistic regression\ncan be married with graphical models to yield t-conditional random \ufb01elds. We will also focus on\nbetter numerical techniques to accelerate the \u03b8-step, especially a faster way to compute gt.\n\n8\n\n\fReferences\n[1] Choon Hui Teo, S. V. N. Vishwanthan, Alex J. Smola, and Quoc V. Le. Bundle methods for\n\nregularized risk minimization. J. Mach. Learn. Res., 11:311\u2013365, January 2010.\n\n[2] S. Ben-David, N. Eiron, and P.M. Long. On the dif\ufb01culty of approximately maximizing agree-\n\nments. J. Comput. System Sci., 66(3):496\u2013514, 2003.\n\n[3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge,\n\nEngland, 2004.\n\n[4] Phil Long and Rocco Servedio. Random classi\ufb01cation noise defeats all convex potential boost-\n\ners. Machine Learning Journal, 78(3):287\u2013304, 2010.\n\n[5] Yoav Freund. A more robust boosting algorithm. Technical Report Arxiv/0905.2138, Arxiv,\n\nMay 2009.\n\n[6] J. Naudts. Deformed exponentials and logarithms in generalized thermostatistics. Physica A,\n\n316:323\u2013334, 2002. URL http://arxiv.org/pdf/cond-mat/0203489.\n\n[7] J. Naudts. Generalized thermostatistics based on deformed exponential and logarithmic func-\n\ntions. Physica A, 340:32\u201340, 2004.\n\n[8] J. Naudts. Generalized thermostatistics and mean-\ufb01eld theory. Physica A, 332:279\u2013300, 2004.\n[9] J. Naudts. Estimators, escort proabilities, and \u03c6-exponential families in statistical physics.\n\nJournal of Inequalities in Pure and Applied Mathematics, 5(4), 2004.\n\n[10] C. Tsallis. Possible generalization of boltzmann-gibbs statistics. J. Stat. Phys., 52, 1988.\n[11] Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n[12] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning.\n\nSpringer, New York, 2 edition, 2009.\n\n[13] Timothy D. Sears. Generalized Maximum Entropy, Convexity, and Machine Learning. PhD\n\nthesis, Australian National University, 2008.\n\n[14] Andre Sousa and Constantino Tsallis. Student\u2019s t- and r-distributions: Uni\ufb01ed derivation from\n\nan entropic variational principle. Physica A, 236:52\u201357, 1994.\n\n[15] A O\u2019hagan. On outlier rejection phenomena in bayes inference. Royal Statistical Society, 41\n\n(3):358\u2013367, 1979.\n\n[16] Kenneth L. Lange, Roderick J. A. Little, and Jeremy M. G. Taylor. Robust statistical modeling\nusing the t distribution. Journal of the American Statistical Association, 84(408):881\u2013896,\n1989.\n\n[17] J. Vanhatalo, P. Jylanki, and A. Vehtari. Gaussian process regression with student-t likelihood.\n\nIn Neural Information Processing System, 2009.\n\n[18] Takahito Kuno, Yasutoshi Yajima, and Hiroshi Konno. An outer approximation method for\nminimizing the product of several convex functions on a convex set. Journal of Global Opti-\nmization, 3(3):325\u2013335, September 1993.\n\n[19] David Mease and Abraham Wyner. Evidence contrary to the statistical view of boosting. J.\n\nMach. Learn. Res., 9:131\u2013156, February 2008.\n\n[20] R. Collobert, F.H. Sinz, J. Weston, and L. Bottou. Trading convexity for scalability. In W.W.\nCohen and A. Moore, editors, Machine Learning, Proceedings of the Twenty-Third Interna-\ntional Conference (ICML 2006), pages 201\u2013208. ACM, 2006.\n\n[21] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research.\n\nSpringer, 1999.\n\n[22] C.C. Chang and C.J. Lin. LIBSVM: a library for support vector machines, 2001. Software\n\navailable at http://www.csie.ntu.edu.tw/\u02dccjlin/libsvm.\n\n[23] Fabian Sinz. UniverSVM: Support Vector Machine with Large Scale CCCP Functionality,\n2006. Software available at http://www.kyb.mpg.de/bs/people/fabee/universvm.\nhtml.\n\n9\n\n\f", "award": [], "sourceid": 177, "authors": [{"given_name": "Nan", "family_name": "Ding", "institution": null}, {"given_name": "S.v.n.", "family_name": "Vishwanathan", "institution": null}]}