{"title": "An interior-point stochastic approximation method and an L1-regularized delta rule", "book": "Advances in Neural Information Processing Systems", "page_first": 233, "page_last": 240, "abstract": "The stochastic approximation method is behind the solution to many important, actively-studied problems in machine learning. Despite its far-reaching application, there is almost no work on applying stochastic approximation to learning problems with constraints. The reason for this, we hypothesize, is that no robust, widely-applicable stochastic approximation method exists for handling such problems. We propose that interior-point methods are a natural solution. We establish the stability of a stochastic interior-point approximation method both analytically and empirically, and demonstrate its utility by deriving an on-line learning algorithm that also performs feature selection via L1 regularization.", "full_text": "An interior-point stochastic approximation\n\nmethod and an L1-regularized delta rule\n\nPeter Carbonetto\npcarbo@cs.ubc.ca\n\nMark Schmidt\n\nschmidtm@cs.ubc.ca\n\nNando de Freitas\nnando@cs.ubc.ca\n\nDepartment of Computer Science\n\nUniversity of British Columbia\n\nVancouver, B.C., Canada V6T 1Z4\n\nAbstract\n\nThe stochastic approximation method is behind the solution to many im-\nportant, actively-studied problems in machine learning. Despite its far-\nreaching application, there is almost no work on applying stochastic ap-\nproximation to learning problems with general constraints. The reason for\nthis, we hypothesize, is that no robust, widely-applicable stochastic ap-\nproximation method exists for handling such problems. We propose that\ninterior-point methods are a natural solution. We establish the stability\nof a stochastic interior-point approximation method both analytically and\nempirically, and demonstrate its utility by deriving an on-line learning al-\ngorithm that also performs feature selection via L1 regularization.\n\n1 Introduction\nThe stochastic approximation method supplies the theoretical underpinnings behind many\nwell-studied algorithms in machine learning, notably policy gradient and temporal dif-\nferences for reinforcement learning,\ninference for tracking and \ufb01ltering, on-line learn-\ning [1, 17, 19], regret minimization in repeated games, and parameter estimation in prob-\nabilistic graphical models, including expectation maximization (EM) and the contrastive\ndivergences algorithm. The main idea behind stochastic approximation is simple yet pro-\nfound. It is simple because it is only a slight modi\ufb01cation to the most basic optimization\nmethod, gradient descent. It is profound because it suggests a fundamentally di\ufb00erent way\nof optimizing a problem\u2014instead of insisting on making progress toward the solution at\nevery iteration, it only requires that progress be achieved on average.\n\nDespite its successes, people tend to steer clear of constraints on the parameters. While\nthere is a sizable body of work on treating constraints by extending established optimization\ntechniques to the stochastic setting, such as projection [14], subgradient (e.g. [19, 27]) and\npenalty methods [11, 24], existing methods are either unreliable or suited only to speci\ufb01c\ntypes of constraints. We argue that a reliable stochastic approximation method that handles\nconstraints is needed because constraints routinely arise in the mathematical formulation of\nlearning problems, and the alternative approach\u2014penalization\u2014is often unsatisfactory.\n\nOur main contribution is a new stochastic approximation method in which each step is the\nsolution to the primal-dual system arising in interior-point methods [7]. Our method is easy\nto implement, dominates other approaches, and provides a general solution to constrained\nlearning problems. Moreover, we show interior-point methods are remarkably well-suited to\nstochastic approximation, a result that is far from trivial when one considers that stochastic\nalgorithms do not behave like their deterministic counterparts (e.g. Wolfe conditions [13]\ndo not apply). We derive a variant of Widrow and Ho\ufb00\u2019s classic \u201cdelta rule\u201d for on-line\nlearning (Sec. 5). It achieves feature selection via L1 regularization (known to statisticians\n\n\fas the Lasso [22] and to signal processing engineers as basis pursuit [3]), so it is well-suited\nto learning problems with lots of data in high dimensions, such as the problem of \ufb01ltering\nspam from your email account (Sec. 5.2). To our knowledge, no method has been proposed\nthat reliably achieves L1 regularization in large-scale problems when data is processed on-\nline or on-demand. Finally, it is important that we establish convergence guarantees for our\nmethod (Sec. 4). To do so, we rely on math from stochastic approximation and optimization.\n\n2 Overview of algorithm\n\nIn their 1952 research paper, Robbins and Monro [15] examined the problem of tuning a\ncontrol variable x (e.g. amount of alkaline solution) so that the expected outcome of the\nexperiment F (x) (pH of soil) attains a desired level \u03b1 (so your Hydrangea have pink blos-\nsoms). When the distribution of the experimental outcomes is unknown to the statistician\nor gardener, it may be still possible to take observations at x. In such case, Robbins and\nMonro showed that a particularly e\ufb00ective way to achieve a response level \u03b1 = 0 is to take\na (hopefully unbiased) measurement yk \u2248 F (xk), adjust the control variable according to\n\n(1)\nfor step size ak > 0, then repeat. Provided the sequence {ak} behaves like the harmonic\nseries (see Sec. 4.1), this algorithm converges to the solution F (x\u22c6) = 0.\n\nxk+1 = xk \u2212 akyk\n\nSince the original publication, mathematicians have extended, generalized, and further weak-\nened the convergence conditions; see [11] for some of these developments. Kiefer and Wol-\nfowitz re-interpreted the stochastic process as one of optimizing an unconstrained objective\n(F (x) acts as the gradient vector) and later Dvoretsky pointed out that each measurement\ny is actually the gradient F (x) plus some noise \u03be(x). Hence, the stochastic gradient algo-\nrithm. In this paper, we introduce a convergent sequence of nonlinear systems F\u00b5(x) = 0 and\ninterpret the Robbins-Monro process {xk} as solving a constrained optimization problem.\n\nprocedure IP\u2013SG (Interior-point stochastic gradient)\n\nfor k = 1, 2, 3, . . .\n\n\u2022 Set max. step size \u02c6ak and centering parameter \u03c3k.\n\u2022 Set barrier parameter \u00b5k = \u03c3kzT\n\u2022 Run simulation to obtain gradient observation yk.\n\u2022 Compute primal-dual search direction (\u2206xk, \u2206zk)\n\nk c(xk)/m.\n\nby solving equations (6,7) with \u2207f (x) = yk.\n\n\u2022 Run backtracking line search to \ufb01nd largest\n\nWe focus on convex optimization\nproblems [2] of the form\n\nminimize\nsubject to c(x) \u2264 0,\n\nf (x)\n\n(2)\n\nwhere c(x) is a vector of inequality\nconstraints, f (x) and c(x) have con-\ntinous partial derivatives, and mea-\nsurements yk of the gradient at xk are\nnoisy. The feasible set, by contrast,\nshould be known exactly. To simplify\nour exposition, we do not consider\nequality constraints; techniques for\nhandling them are discussed in [13].\nConvexity is a standard assumption made to simplify analysis of stochastic approximation\nalgorithms and, besides, constrained, non-convex optimization raises unresolved complica-\ntions. We assume standard constraint quali\ufb01cations so we can legitimately identify optimal\nsolutions via the Karush-Kuhn-Tucker (KKT) conditions [2, 13].\n\nak \u2264 min{\u02c6ak, 0.995 mini(\u2212zk,i/\u2206zk,i)} such\nthat c(xk\u22121 + ak\u2206xk) < 0, and mini( \u00b7 ) is\nover all i such that \u2206zk,i < 0.\n\nFigure 1: Proposed stochastic gradient algorithm.\n\n\u2022 Set xk = xk\u22121 + ak\u2206xk and zk = zk\u22121 + ak\u2206zk.\n\nFollowing the standard barrier approach [7], we frame the constrained optimization problem\nas a sequence of unconstrained objectives. This in turn is cast as a sequence of root-\ufb01nding\nproblems F\u00b5(x) = 0, where \u00b5 > 0 controls for the accuracy of the approximate objective\nand should tend toward zero. As we explain, a dramatically more e\ufb00ective strategy is to\nsolve for the root of the primal-dual equations F\u00b5(x, z), where z represents the set of dual\nvariables. This is the basic formula of the interior-point stochastic approximation method.\n\nFig. 1 outlines our main contribution. Provided x0 is feasible and z0 > 0, every subsequent\niterate (xk, zk) will be a feasible or \u201cinterior\u201d point as well. Notice the absence of a su\ufb03-\ncient decrease condition on kF\u00b5(x, z)k or suitable merit function; this is not needed in the\nstochastic setting. Our stochastic approximation algorithm requires a slightly non-standard\ntreatment because the target F\u00b5(x, z) moves as \u00b5 changes. Fortunately, convergence under\nnon-stationarity has been studied in the literature on tracking and adaptive \ufb01ltering. The\nnext section is devoted to deriving the primal-dual search direction (\u2206x, \u2206z).\n\n\f3 Background on interior-point methods\nWe motivate and derive primal-dual interior-point methods starting from the logarithmic\nbarrier method. Barrier methods date back to the work of Fiacco and McCormick [6] in\nthe 1960s, but they lost favour due to their unreliable nature.\nIll-conditioning was long\nconsidered their undoing. However, careful analysis [7] has shown that poor conditioning is\nnot the problem\u2014rather, it is a de\ufb01ciency in the search direction. In the next section, we\nexploit this very analysis to show that every iteration of our algorithm produces a stable\niterate in the face of: 1) ill-conditioned linear systems, 2) noisy observations of the gradient.\n\nThe logarithmic barrier approach for the constrained optimization problem (2) amounts to\nsolving a sequence of unconstrained subproblems of the form\n\nminimize\n\n(3)\nwhere \u00b5 > 0 is the barrier parameter, and m is the number of inequality constraints.\nAs \u00b5 becomes smaller, the barrier function f\u00b5(x) acts more and more like the objective.\nThe philosophy of barrier methods di\ufb00ers fundamentally from \u201cexterior\u201d penalty methods\nthat penalize points violating the constraints [13, Chapter 17] because the logarithm in (3)\nprevents iterates from violating the constraints at all, hence the word \u201cbarrier\u201d.\n\nf\u00b5(x) \u2261 f (x) \u2212 \u00b5Pm\n\ni=1 log(\u2212ci(x)),\n\nThe central thrust of the barrier method is to progressively push \u00b5 to zero at a rate which\nallows the iterates to converge to the constrained optimum x\u22c6. Writing out a \ufb01rst-order\nTaylor-series expansion to the optimality conditions \u2207f\u00b5(x) = 0 about a point x, the Newton\nstep \u2206x is the solution to the linear equations \u22072f\u00b5(x) \u2206x = \u2212\u2207f\u00b5(x). The barrier Hessian\nhas long been known to be incredibly ill-conditioned\u2014this fact becomes apparent by writing\nout \u22072f\u00b5(x) in full\u2014but an analysis by Wright [25] shows that the ill-conditioning is not\nharmful under the right conditions. The \u201cright conditions\u201d are that x be within a small\ndistance1 from the central path or barrier trajectory, which is de\ufb01ned to be the sequence of\nisolated minimizers x\u22c6\n\u00b5) < 0. The bad news: the barrier\nmethod is ine\ufb00ectual at remaining on the barrier trajectory\u2014it pushes iterates too close to\nthe boundary where they are no longer well-behaved [7]. Ordinarily, a convergence test is\nconducted for each value of \u00b5, but this is not a plausible option for the stochastic setting.\n\n\u00b5 satisfying \u2207f\u00b5(x\u22c6\n\n\u00b5) = 0 and c(x\u22c6\n\nPrimal-dual methods form a Newton search direction for both the primal variables and the\nLagrange multipliers. Like classical barrier methods, they fail catastrophically outside the\ncentral path. But their virtue is that they happen to be extremely good at remaining on\nthe central path (even in the stochastic setting; see Sec. 4.2). Primal-dual methods are also\nblessed with strong results regarding superlinear and quadratic rates of convergence [7].\n\nThe principal innovation is to introduce Lagrange multiplier-like variables zi \u2261 \u2212\u00b5/ci(x).\nBy setting \u2207xf\u00b5(x) to zero, we recover the \u201cperturbed\u201d KKT optimality conditions:\n\nF\u00b5(x, z) \u2261(cid:20) \u2207xf (x) + J T Z1\n\nCZ1 + \u00b51\n\n(cid:21) = 0,\n\nwhere Z and C are matrices with z and c(x) along their diagonals, and J \u2261 \u2207xc(x). Forming\na \ufb01rst-order Taylor expansion about (x, z), the primal-dual Newton step is the solution to\n\n(4)\n\n(5)\n\n(cid:20) W J T\nZJ C (cid:21)(cid:20) \u2206x\n\n\u2206z (cid:21) = \u2212(cid:20) \u2207xf (x) + J T Z1\n\nCZ1 + \u00b51\n\n(cid:21) ,\n\nwhere W = H +Pm\n\nxci(x) is the Hessian of the Lagrangian (as written in any textbook\non constrained optimization), and H is the Hessian of the objective or an approximation.\nThrough block elimination, the Newton step \u2206x is the solution to the symmetric system\n\ni=1 zi\u22072\n\n(W \u2212 J T \u03a3J)\u2206x = \u2212\u2207xf\u00b5(x),\n\n(6)\n\nwhere \u03a3 \u2261 C \u22121Z. The dual search direction is then recovered according to\n\n\u2206z = \u2212(z + \u00b5/c(x) + \u03a3J\u2206x).\n\n(7)\nBecause (2) is a convex optimization problem, we can derive a sensible update rule for the\nbarrier parameter by guessing the distance between the primal and dual objectives [2]. This\nguess is typically \u00b5 = \u2212\u03c3zT c(x)/m, where \u03c3 > 0 is a centering parameter. This update is\nsupported by the convergence theory (Sec. 4.1) so long as \u03c3k is pushed to zero.\n\n1See Sec. 4.3.1 of [7] for the precise meaning of a \u201csmall distance\u201d. Since x must be close to the\n\ncentral path but far from the boundary, the favourable neighbourhood shrinks as \u00b5 nears 0.\n\n\f4 Analysis of convergence\n\nFirst we establish conditions upon which the sequence of iterates generated by the algorithm\nconverges almost surely to the solution (x\u22c6, z\u22c6) as the amount of data or iteration count goes\nto in\ufb01nity. Then we examine the behaviour of the iterates under \ufb01nite-precision arithmetic.\n\n4.1 Asymptotic convergence\n\nA convergence proof from \ufb01rst principles is beyond the scope of this paper; we build upon\nthe martingale convergence proof of Spall and Cristion for non-stationary systems [21].\n\nAssumptions: We establish convergence under the following conditions. They may be\nweakened by applying results from the stochastic approximation and optimization literature.\n1. Unbiased observations: yk is a discrete-time martingale di\ufb00erence with respect to\n\nthe true gradient \u2207f (xk); that is, E(yk | xk, history up to time k) = \u2207f (xk).\n\n2. Step sizes: The maximum step sizes \u02c6ak bounding ak (see Fig. 1) must approach\n\nk=1 \u02c6ak = \u221e).\n\nzero (\u02c6ak \u2192 0 as k \u2192 \u221e and P\u221e\n\nk=1 \u02c6a2\n\nk < \u221e) but not too quickly (P\u221e\n\n3. Bounded iterates: lim supk kxkk < \u221e almost surely.\n4. Bounded gradient estimates: for some \u03c1 and for every k, E(kykk) < \u03c1.\n5. Convexity: The objective f (x) and constraints c(x) are convex.\n6. Strict feasibility: There must exist an x that is strictly feasible; i.e. c(x) < 0.\n7. Regularity assumptions: There exists a feasible minimizer x\u22c6 to the problem (2)\nsuch that \ufb01rst-order constraint quali\ufb01cation and strict complementarity hold, and\n\u2207xf (x), \u2207xc(x) are Lipschitz-continuous. These conditions allow us to directly apply\nstandard theorems on constrained optimization for convex programming [2, 6, 7, 13].\nProposition: Suppose Assumptions 1\u20137 hold. Then \u03b8\u22c6 \u2261 (x\u22c6, z\u22c6) is an isolated (locally\nunique within a \u03b4-neighbourhood) solution to (2), and the iterates \u03b8k \u2261 (xk, zk) of the\nfeasible interior-point stochastic approximation method (Fig. 1) converge to \u03b8\u22c6 almost surely;\nthat is, as k approaches the limit, ||\u03b8k \u2212 \u03b8\u22c6|| = 0 with probability 1.\nProof: See Appendix A.\n\n4.2 Considerations regarding the central path\n\nThe object of this section is to establish that computing the stochastic primal-dual search\ndirection is numerically stable. (See Part III of [23] for what we mean by \u201cstable\u201d.) The\nconcern is that noisy gradient measurements will lead to wildly perturbed search directions.\nAs we mentioned in Sec. 3, interior-point methods are surprisingly stable provided the\niterates remain close to the central path, but the prospect of keeping close to the path\nseems particularly tenuous in the stochastic setting. A key observation is that the central\npath is itself perturbed by the stochastic gradient estimates. Following arguments similar\nto those given in Sec. 5 of [7], we show that the stochastic Newton step (6,7) stays on target.\n\nWe de\ufb01ne the noisy central path as \u03b8(\u00b5, \u03b5) = (x, z), where (x, z) is a solution to F\u00b5(x, z) = 0\nwith gradient estimate y \u2261 \u2207f (x) + \u03b5. Suppose we are currently at point \u03b8(\u00b5, \u03b5) = (x, z)\nalong the path, and the goal is to move closer to \u03b8(\u00b5\u22c6, \u03b5\u22c6) = (x\u22c6, z\u22c6) by solving (5) or (6,7).\nOne way to assess the quality of the Newton step is to compare it to the tangent line of the\nnoisy central path at (\u00b5, \u03b5). Taking implicit partial derivatives at (x, z), the tangent line is\n\n\u03b8(\u00b5\u22c6, \u03b5\u22c6) \u2248 \u03b8(\u00b5, \u03b5) + (\u00b5\u22c6 \u2212 \u00b5) \u2202\u03b8(\u00b5,\u03b5)\n\n\u2202\u00b5 + (y\u22c6 \u2212 y) \u2202\u03b8(\u00b5,\u03b5)\n\n\u2202\u03b5\n\n,\n\nsuch that\n\nZJ C (cid:21)\" (\u00b5\u22c6 \u2212 \u00b5) \u2202x\n(cid:20) H J T\n\n(\u00b5\u22c6 \u2212 \u00b5) \u2202z\n\n\u2202\u00b5 + (y\u22c6 \u2212 y) \u2202x\n\u2202\u00b5 + (y\u22c6 \u2212 y) \u2202z\n\n\u2202\u03b5 # = \u2212(cid:20)\n\n\u2202\u03b5\n\ny\u22c6 \u2212 y\n\n(\u00b5\u22c6 \u2212 \u00b5)1, (cid:21) .\n\n(8)\n\n(9)\n\nwith y\u22c6 \u2261 \u2207f (x) + \u03b5\u22c6. Since we know that F\u00b5(x, z) = 0, the Newton step (5) at (x, z) with\nperturbation \u00b5\u22c6 and stochastic gradient estimate y\u22c6 is the solution to\n\n(cid:20) H J T\nZJ C (cid:21)(cid:20) \u2206x\n\n\u2206z (cid:21) = \u2212(cid:20)\n\ny\u22c6 \u2212 y\n\n(\u00b5\u22c6 \u2212 \u00b5)1. (cid:21) .\n\n(10)\n\nIn conclusion, if the tangent line (8) is a fairly reasonable approximation to the central path,\nthen the stochastic Newton step (10) will make good progress toward \u03b8(\u00b5\u22c6, \u03b5\u22c6).\n\n\fHaving established that the stochastic gradient algorithm closely follows the noisy central\npath, the analysis of M. H. Wright [26] directly applies, in which round-o\ufb00 error (\u01ebmachine)\nis occasionally replaced by gradient noise (\u03b5). Since stability is of fundamental concern\u2014\nparticularly in computing the values of W \u2212 J T \u03a3J, the right-hand side of (6), and the\nsolution to \u2206x and \u2206z\u2014we elaborate on the signi\ufb01cance of Wright\u2019s results in Appendix B.\n\n5 On-line L1 regularization\nIn this section, we apply our \ufb01ndings to the problem of computing an L1-regularized least\nsquares estimator in an \u201con-line\u201d manner; that is, by making adjustments to each new\nexample without having to review all the previous training instances. While this problem\nonly involves simple bound constraints, we can use it to compare our method to existing\napproaches such as gradient projection. We start with some background behind the L1,\nmotivate the on-line learning approach, draw some experimental comparisons with existing\nmethods, then show that our algorithm can be used to \ufb01lter spam.\nSuppose we have n training examples xi \u2261 (xi1, . . . , xim)T paired with real-valued responses\nyi. (The notation here is separate from previous sections.) Assuming a linear model and\ncentred coordinates, the least squares estimate \u03b2 minimizes the mean squared error (MSE).\nLinear regression based on the maximum likelihood estimator is one of the basic statistical\ntools of science and engineering and, while primitive, generalizes to many popular statistical\nestimators, including linear discriminant analysis [9]. Because the least squares estimator is\nunstable when m is large, it can generalize poorly to unseen examples. The standard cure\nis \u201cregularization,\u201d which introduces bias, but typically produces estimators that are better\nat predicting the outputs of unseen examples. For instance, the MSE with an L1-penalty,\n\nMSE(L1) \u2261 1\n\ni=1(yi \u2212 xT\n\ni \u03b2)2 + \u03bb\n\nn k\u03b2k1,\n\n(11)\n\n2nPn\n\nnot only prevents over\ufb01tting but tends to produce estimators that shrink many of the\ncomponents \u03b2j to zero, resulting in sparse codes. Here, k \u00b7 k1 is the L1 norm and \u03bb > 0\ncontrols for the level of regularization. This approach has been independently studied for\nmany problems, including statistical regression [22] and sparse signal reconstruction [3, 10],\nprecisely because it is e\ufb00ective at choosing useful features for prediction.\n\nWe can treat the gradient of MSE as a sample expectation over responses of the form\n\u2212xi(yi \u2212 xT\n\ni \u03b2), so the on-line or stochastic update\n\n\u03b2(new) = \u03b2 + axi(yi \u2212 xT\n\ni \u03b2),\n\n(12)\n\nimproves the linear regression with only a single data point (a is the step size).2 This is the\nfamed \u201cdelta rule\u201d of Widrow and Ho\ufb00 [12]. Since standard \u201cbatch\u201d learning requires a full\npass through the data for each gradient evaluation, the on-line update (12) may be the only\nviable option when faced with, for instance, a collection of 80 million images [16]. On-line\nlearning for regression and classi\ufb01cation\u2014including L2 regularization\u2014is a well-researched\ntopic, particularly for neural networks [17] and support vector machines (e.g. [19]). On-line\nlearning with L1 regularization, despite its ascribed bene\ufb01ts, has strangely avoided study.\n(The only known work that has approached the problem is [27] using subgradient methods.)\n\nWe derive an on-line, L1-regularized learning rule of the form\n\nz(new)\n\u03b2(new)\npos = zpos + a(\u00b5/\u03b2pos \u2212 zpos \u2212 \u2206\u03b2poszpos/\u03b2pos)\npos = \u03b2pos + a\u2206\u03b2pos\n\u03b2(new)\nz(new)\nneg = zneg + a(\u00b5/\u03b2neg \u2212 zneg \u2212 \u2206\u03b2negzneg/\u03b2neg),\nneg = \u03b2neg + a\u2206\u03b2neg\nsuch that \u2206\u03b2pos = (xi(yi \u2212 xT\n\nn + \u00b5/\u03b2pos)/(1 + zpos/\u03b2pos)\n\ni \u03b2) \u2212 \u03bb\n\n\u2206\u03b2neg = (\u2212xi(yi + xT\n\ni \u03b2) \u2212 \u03bb\n\nn + \u00b5/\u03b2neg)/(1 + zneg/\u03b2neg),\n\n(13)\n\nand where \u00b5 > 0 is the barrier parameter, \u03b2 = \u03b2pos \u2212 \u03b2neg, zpos and zneg are the Lagrange\nmultipliers associated with the lower bounds \u03b2pos \u2265 0 and \u03b2neg \u2265 0, respectively, and a is a\nstep size ensuring the variables remain in the positive quadrant. Multiplication and division\nin (13) are component-wise. The remainder of the algorithm (Fig. 1) consists of choosing \u00b5\nand feasible step size a at each iteration. Let us brie\ufb02y explain how we arrived at (13).\n\n2The gradient descent direction can be a poor choice because it ignores the scaling of the problem.\nMuch work has focused on improving the delta rule, but we shall not discuss these improvements.\n\n\f \n\n \n \n\n \n\n \n \n\nFigure 2: (left) Performance of constrained stochastic gradient methods for di\ufb00erent step size\nsequences. (right) Performance of methods for increasing levels of variance in the dimensions\nof the training data. Note the logarithmic scale in the vertical axis.\n\nIt is di\ufb03cult to \ufb01nd a collection of regression coe\ufb03cients \u03b2 that directly minimizes MSE(L1)\nbecause the L1 norm is not di\ufb00erentiable near zero. The trick is to separate the coe\ufb03cients\ninto their positive (\u03b2pos) and negative (\u03b2neg) components following [3], thereby transform-\ning the non-smooth, unconstrained optimization problem (11) into a smooth problem with\nconvex, quadratic objective and bound constraints \u03b2pos, \u03b2neg \u2265 0. The regularized delta\nrule (13) is then obtained from direct application of the primal-dual interior-point Newton\nsearch direction (6,7) with a stochastic gradient (see Eq. 12), and identity in place of H.\n\n5.1 Experiments\n\nWe ran four small experiments to assess the reliability and shrinkage e\ufb00ect of the interior-\npoint stochastic gradient method for linear regression with L1 regularization; refer to Fig. 1\nand Eq. 13.3 We also studied four alternatives to our method: 1) a subgradient method,\n2) a smoothed, unconstrained approximation to (11), 3) a projected gradient method, and\n4) the augmented Lagrangian approach described in [24]. See [18] for an in-depth discussion\nof the merits of applying the \ufb01rst three optimization approaches to L1 regularization. All\nthese methods have a per-iteration cost on the order of the number of features.\n\nMethod. For the \ufb01rst three experiments, we simulated 20 data sets following the procedure\ndescribed in Sec. 7.5 of [22]. Each data set had n = 100 observations with m = 40 features.\nWe de\ufb01ned observations by xij = zij + zi, where zi was drawn from the standard normal\nand zij was drawn i.i.d. from the normal with variance \u03c32\nj , which in turn was drawn from\nthe inverse Gamma with shape 2.5 and scale \u03bd = 1. (The mean of \u03c32\nj is proportional to \u03bd.)\nThe regression coe\ufb03cients were \u03b2 = (0, . . . , 0, 2, . . . , 2, 0, . . . , 0, 2, . . . , 2)T with 10 repeats in\neach block [22]. Outputs were generated according to yi = \u03b2T xi + \u01eb with standard Gaussian\nnoise \u01eb. Each method was executed with a single pass on the data (100 iterations) with\nstep sizes \u02c6ak = 1/(k0 + k), where k0 = 50 by default. We chose L1 penalty \u03bb/n = 1.25,\nwhich tended to produce about 30% zero coe\ufb03cients at the solution to (11). The augmented\nLagrangian required a sequence of penalty terms rk \u2192 0; after some trial and error, we chose\nrk = 50/(k0 + k)0.1. The control variables of Experiments 1, 2 and 3 were, respectively, the\nstep size parameter k0, the inverse Gamma scale parameter \u03bd, and the L1 penalty parameter\n\u03bb. In Experiment 4, each example yi in the training set xi had 8 features, and we set the\ntrue coe\ufb03cients were set to \u03b2 = (0, 0, 2, \u22124, 0, 0, \u22121, 3)T .\nResults. Fig. 2 shows the results of Experiments 1 and 2, with error 1\nn k\u03b2exact \u2212 \u03b2on-linek1\naveraged over the 20 data sets, in which \u03b2exact is the solution to (11), and \u03b2on-line is the esti-\nmate obtained after 100 iterations of the on-line or stochastic gradient method. With a large\nenough step size, almost all the methods converged close to \u03b2exact. The stochastic interior-\npoint method, however, always came closest to \u03b2exact and, for the range of values we tried, its\nsolution was by far the least sensitive to the step size sequence and level of variance in the ob-\nservations. Experiment 3 (Fig. 3) shows that even with well-chosen step sizes for all methods,\n\n3The Matlab code for all our experiments is on the Web at http://www.cs.ubc.ca/\u223cpcarbo.\n\n\fFigure 4: Shrinkage e\ufb00ect for di\ufb00erent choices of the L1 penalty parameter.\n\nthe stochastic interior-point method still best approximated the exact solution, and its per-\nformance did not degrade when \u03bb was small. (The dashed vertical line at \u03bb/n = 1.25 in Fig. 3\ncorresponds to k0 = 50 and E(\u03c32) = 2/3 in the left and right plots of Fig. 2.) Fig. 4 shows the\nregularized estimates of Experiment 4. After one\npass through the data (middle)\u2014equivalent to a\nsingle iteration of an exact solver\u2014the interior-\npoint stochastic gradient method shrank some\nof the data components, but didn\u2019t quite dis-\ncard irrelevant features altogether. After 10 vis-\nits to the training data (right), the stochastic al-\ngorithm exhibited feature selection close to what\nwe would normally expect from the Lasso (left).\n\n \n\n \n\n5.2 Filtering spam\n\n \n\nClassifying email as spam or not is most faith-\nfully modeled as an on-line learning problem in\nwhich supervision is provided after each email\nhas been designated for the inbox or trash [5]. An e\ufb00ective \ufb01lter is one that minimizes mis-\nclassi\ufb01cation of incoming messages\u2014throwing away a good email being considerably more\ndeleterious than incorrectly placing a spam in the inbox. Without any prior knowledge as\nto what spam looks like, any \ufb01lter will be error-prone at initial stages of deployment.\n\nFigure 3: Performance of the methods\nfor various choices of the L1 penalty.\n\nSpam \ufb01ltering necessarily involves lots of data and an even larger number of features, so\na sparse, stable model is essential. We adapted the L1-regularized delta rule to the spam\n\ufb01ltering problem by replacing the linear regression with a binary logistic regression [9]. The\non-line updates are similar to (13), only xT\ni \u03b2), with \u03c6(u) \u2261 1/(1+e\u2212u).\nTo our knowledge, no one has investigated this approach for on-line spam \ufb01ltering, though\nthere is some work on logistic regression plus the Lasso for batch classi\ufb01cation in text\ncorpora [8]. Needless to say, batch learning is completely impractical in this setting.\n\ni \u03b2 is replaced by \u03c6(xT\n\n2 , \u02c6ai = 1\n\nMethod. We simulated the on-line spam \ufb01ltering task on the trec2005 corpus [4] contain-\ning emails from the legal investigations of Enron corporation. We compared our on-line clas-\nsi\ufb01er (\u03bb = 10, \u03c3 = 1\n1+i ) with two open-source software packages, SpamBayes 1.0.3\nand Bogo\ufb01lter 0.93.4. (These packages are publicly available at spambayes.sourceforge.net\nand bogo\ufb01lter.sourceforge.net.) A full comparison is certainly beyond the scope of this paper;\nsee [5] for a comprehensive evaluation. We represented each email as a vector of normalized\nword frequencies, and used the word tokens extracted by SpamBayes. In the end, we had\nan on-line learning problem involving n = 92189 documents and m = 823470 features.\n\ntrue\n\nnot spam spam\n\ntrue\n\nnot spam spam\n\ntrue\n\nnot spam spam\n\n. not spam 39382\n\nspam\n\n17\n\nd\ne\nr\np\n\n3291\n49499\n\n. not spam 39393\n\nspam\n\n3\n\nd\ne\nr\np\n\n5515\n47275\n\n. not spam 39389\n\nspam\n\n10\n\nd\ne\nr\np\n\n2803\n49987\n\nResults for SpamBayes\n\nResults for Bogo\ufb01lter\n\nResults for Logistic + L1\n\nTable 1: Contingency tables for on-line spam \ufb01ltering task on the trec2005 data set.\n\n\fResults. Following [5], we use contingency tables to present results of the on-line spam\n\ufb01ltering experiment (Table 1). The top-right/bottom-left entry of each table is the number\nof misclassi\ufb01ed spam/non-spam. Everything was evaluated on-line. We tagged an email\nfor deletion only if p(yi = spam) \u2265 97%. Our spam \ufb01lter dominated SpamBayes on the\ntrec2005 corpus, and performed comparably to Bogo\ufb01lter\u2014one of the best spam \ufb01lters to\ndate [5]. Our model\u2019s expense was slightly greater than the others. As we found in Sec. 5.1,\nassessing sparsity of the on-line solution is more di\ufb03cult than in the exact case, but we can\nsay that removing the 41% smallest entries of \u03b2 resulted in almost no (< 0.001) change.\n\n6 Conclusions\nOur experiments on a learning problem with noisy gradient measurements and bound con-\nstraints show that the interior-point stochastic approximation algorithm is a signi\ufb01cant\nimprovement over other methods. The interior-point approach also has the virtue of being\nmuch more general, and our analysis guarantees that it will be numerically stable.\n\nAcknowledgements. Thanks to Ewout van den Berg, Matt Ho\ufb00man and Firas Hamze.\n\nReferences\n\n[1] L. Bottou and O. Bousquet, The tradeo\ufb00s of large scale learning, in Advances in Neural Infor-\n\nmation Processing Systems, vol. 20, 1998.\n\n[2] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge University Press, 2004.\n[3] S. Chen, D. Donoho, and M. Saunders, Atomic decomposition by basis pursuit, SIAM Journal\n\non Scienti\ufb01c Computing, 20 (1999), pp. 33\u201361.\n\n[4] G. V. Cormack and T. R. Lynam, Spam corpus creation for TREC, in Proc. 2nd CEAS, 2005.\n, Online supervised spam \ufb01lter evaluation, ACM Trans. Information Systems, 25 (2007).\n[5]\n[6] A. V. Fiacco and G. P. McCormick, Nonlinear programming: sequential unconstrained mini-\n\nmization techniques, John Wiley and Sons, 1968.\n\n[7] A. Forsgren, P. E. Gill, and M. H. Wright, Interior methods for nonlinear optimization, SIAM\n\nReview, 44 (2002), pp. 525\u2013597.\n\n[8] A. Genkin, D. D. Lewis, and D. Madigan, Large-scale Bayesian logistic regression for text\n\ncategorization, Technometrics, 49 (2007), pp. 291\u2013304.\n\n[9] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning, Springer, 2001.\n[10] S.-J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky, An interior-point method for\nlarge-scale L1-regularized least squares, IEEE J. Selected Topics in Signal Processing, 1 (2007).\n[11] H. J. Kushner and D. S. Clark, Stochastic approximation methods for constrained and uncon-\n\nstrained systems, Springer-Verlag, 1978.\n\n[12] T. M. Mitchell, Machine Learning, McGraw-Hill, 1997.\n[13] J. Nocedal and S. J. Wright, Numerical Optimization, Springer, 2nd ed., 2006.\n[14] B. T. Poljak, Nonlinear programming methods in the presence of noise, Mathematical Pro-\n\ngramming, 14 (1978), pp. 87\u201397.\n\n[15] H. Robbins and S. Monro, A stochastic approximation method, Annals Math. Stats., 22 (1951).\n[16] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, LabelMe: a database and\nweb-based tool for image annotation, Intl. Journal of Computer Vision, 77 (2008), pp. 157\u2013173.\n\n[17] D. Saad, ed., On-line learning in neural networks, Cambridge University Press, 1998.\n[18] M. Schmidt, G. Fung, and R. Rosales, Fast optimization methods for L1 regularization, in\n\nProceedings of the 18th European Conference on Machine Learning, 2007, pp. 286\u2013297.\n\n[19] S. Shalev-Shwartz, Y. Singer, and N. Srebro, Pegasos: primal estimated sub-gradient solver for\n\nSVM, in Proceedings of the 24th Intl. Conference on Machine learning, 2007, pp. 807\u2013814.\n\n[20] J. C. Spall, Adaptive stochastic approximation by the simultaneous perturbation method, IEEE\n\nTransactions on Automatic Control, 45 (2000), pp. 1839\u20131853.\n\n[21] J. C. Spall and J. A. Cristion, Model-free control of nonlinear stochastic systems with discrete-\n\ntime measurements, IEEE Transactions on Automatic Control, 43 (1998), pp. 1148\u20131210.\n\n[22] R. Tibshirani, Regression shrinkage and selection via the Lasso, Journal of the Royal Statistical\n\nSociety, 58 (1996), pp. 267\u2013288.\n\n[23] L. N. Trefethen and D. Bau, Numerical linear algebra, SIAM, 1997.\n[24] I. Wang and J. C. Spall, Stochastic optimization with inequality constraints using simultaneous\n\nperturbations and penalty functions, in Proc. 42nd IEEE Conf. Decision and Control, 2003.\n\n[25] M. H. Wright, Some properties of the Hessian of the logarithmic barrier function, Mathematical\n\nProgramming, 67 (1994), pp. 265\u2013295.\n\n[26]\n\n, Ill-conditioning and computational error in interior methods for nonlinear programming,\n\nSIAM Journal on Optimization, 9 (1998), pp. 84\u2013111.\n\n[27] A. Zheng, Statistical software debugging, PhD thesis, University of California, Berkeley, 2005.\n\n\f", "award": [], "sourceid": 4, "authors": [{"given_name": "Peter", "family_name": "Carbonetto", "institution": null}, {"given_name": "Mark", "family_name": "Schmidt", "institution": null}, {"given_name": "Nando", "family_name": "Freitas", "institution": null}]}