{"title": "Proximal Quasi-Newton for Computationally Intensive L1-regularized M-estimators", "book": "Advances in Neural Information Processing Systems", "page_first": 2375, "page_last": 2383, "abstract": "We consider the class of optimization problems arising from computationally intensive L1-regularized M-estimators, where the function or gradient values are very expensive to compute. A particular instance of interest is the L1-regularized MLE for learning Conditional Random Fields (CRFs), which are a popular class of statistical models for varied structured prediction problems such as sequence labeling, alignment, and classification with label taxonomy. L1-regularized MLEs for CRFs are particularly expensive to optimize since computing the gradient values requires an expensive inference step. In this work, we propose the use of a carefully constructed proximal quasi-Newton algorithm for such computationally intensive M-estimation problems, where we employ an aggressive active set selection technique. In a key contribution of the paper, we show that our proximal quasi-Newton algorithm is provably super-linearly convergent, even in the absence of strong convexity, by leveraging a restricted variant of strong convexity. In our experiments, the proposed algorithm converges considerably faster than current state-of-the-art on the problems of sequence labeling and hierarchical classification.", "full_text": "Proximal Quasi-Newton for Computationally\n\nIntensive (cid:96)1-regularized M-estimators\n\nKai Zhong 1\n\nIan E.H. Yen 2\n\nInderjit S. Dhillon 2\n\n1 Institute for Computational Engineering & Sciences\n\nPradeep Ravikumar 2\n2 Department of Computer Science\n\nzhongkai@ices.utexas.edu, {ianyen,inderjit,pradeepr}@cs.utexas.edu\n\nUniversity of Texas at Austin\n\nAbstract\n\nWe consider the class of optimization problems arising from computationally in-\ntensive (cid:96)1-regularized M-estimators, where the function or gradient values are\nvery expensive to compute. A particular instance of interest is the (cid:96)1-regularized\nMLE for learning Conditional Random Fields (CRFs), which are a popular class\nof statistical models for varied structured prediction problems such as sequence\nlabeling, alignment, and classi\ufb01cation with label taxonomy. (cid:96)1-regularized MLEs\nfor CRFs are particularly expensive to optimize since computing the gradient val-\nues requires an expensive inference step. In this work, we propose the use of a\ncarefully constructed proximal quasi-Newton algorithm for such computationally\nintensive M-estimation problems, where we employ an aggressive active set se-\nlection technique. In a key contribution of the paper, we show that the proximal\nquasi-Newton method is provably super-linearly convergent, even in the absence\nof strong convexity, by leveraging a restricted variant of strong convexity. In our\nexperiments, the proposed algorithm converges considerably faster than current\nstate-of-the-art on the problems of sequence labeling and hierarchical classi\ufb01ca-\ntion.\n\nIntroduction\n\n1\n(cid:96)1-regularized M-estimators have attracted considerable interest in recent years due to their ability\nto \ufb01t large-scale statistical models, where the underlying model parameters are sparse. The opti-\nmization problem underlying these (cid:96)1-regularized M-estimators takes the form:\n\nf (w) := \u03bb(cid:107)w(cid:107)1 + (cid:96)(w),\n\nmin\nw\n\n(1)\n\nwhere (cid:96)(w) is a convex differentiable loss function. In this paper, we are particularly interested in the\ncase where the function or gradient values are very expensive to compute; we refer to these functions\nas computationally intensive functions, or CI functions in short. A particular case of interest are (cid:96)1-\nregularized MLEs for Conditional Random Fields (CRFs), where computing the gradient requires\nan expensive inference step.\nThere has been a line of recent work on computationally ef\ufb01cient methods for solving (1), including\n[2, 8, 13, 21, 23, 4].\nIt has now become well understood that it is key to leverage the sparsity\nof the optimal solution by maintaining sparse intermediate iterates [2, 5, 8]. Coordinate Descent\n(CD) based methods, like CDN [8], maintain the sparsity of intermediate iterates by focusing on an\nactive set of working variables. A caveat with such methods is that, for CI functions, each coordinate\nupdate typically requires a call of inference oracle to evaluate partial derivative for single coordinate.\nOne approach adopted in [16] to address this is using Blockwise Coordinate Descent that updates\na block of variables at a time by ignoring the second-order effect, which however sacri\ufb01ces the\nconvergence guarantee. Newton-type methods have also attracted a surge of interest in recent years\n[5, 13], but these require computing the exact Hessian or Hessian-vector product, which is very\n\n1\n\n\fexpensive for CI functions. This then suggests the use of quasi-Newton methods, popular instances\nof which include OWL-QN [23], which is adapted from (cid:96)2-regularized L-BFGS, as well as Projected\nQuasi-Newton (PQN) [4]. A key caveat with OWL-QN and PQN however is that they do not exploit\nIn this paper, we consider the class of Proximal Quasi-\nthe sparsity of the underlying solution.\nNewton (Prox-QN) methods, which we argue seem particularly well-suited to such CI functions, for\nthe following three reasons. Firstly, it requires gradient evaluations only once in each outer iteration.\nSecondly, it is a second-order method, which has asymptotic superlinear convergence. Thirdly, it\ncan employ some active-set strategy to reduce the time complexity from O(d) to O(nnz), where d\nis the number of parameters and nnz is the number of non-zero parameters.\nWhile there has been some recent work on Prox-QN algorithms [2, 3], we carefully construct an\nimplementation that is particularly suited to CI (cid:96)1-regularized M-estimators. We carefully main-\ntain the sparsity of intermediate iterates, and at the same time reduce the gradient evaluation time.\nA key facet of our approach is our aggressive active set selection (which we also term a \u201dshrink-\ning strategy\u201d) to reduce the number of active variables under consideration at any iteration, and\ncorrespondingly the number of evaluations of partial gradients in each iteration. Our strategy is\nparticularly aggressive in that it runs over multiple epochs, and in each epoch, chooses the next\nworking set as a subset of the current working set rather than the whole set; while at the end of an\nepoch, allows for other variables to come in. As a result, in most iterations, our aggressive shrinking\nstrategy only requires the evaluation of partial gradients in the current working set. Moreover, we\nadapt the L-BFGS update to the shrinking procedure such that the update can be conducted without\nany loss of accuracy caused by aggressive shrinking. Thirdly, we store our data in a feature-indexed\nstructure to combine data sparsity as well as iterate sparsity.\n[26] showed global convergence and asymptotic superlinear convergence for Prox-QN methods un-\nder the assumption that the loss function is strongly convex. However, this assumption is known to\nfail to hold in high-dimensional sampling settings, where the Hessian is typically rank-de\ufb01cient, or\nindeed even in low-dimensional settings where there are redundant features. In a key contribution\nof the paper, we provide provable guarantees of asymptotic superlinear convergence for Prox-QN\nmethod, even without assuming strong-convexity, but under a restricted variant of strong convex-\nity, termed Constant Nullspace Strong Convexity (CNSC), which is typically satis\ufb01ed by standard\nM-estimators.\nTo summarize, our contributions are twofold. (a) We present a carefully constructed proximal quasi-\nNewton method for computationally intensive (CI) (cid:96)1-regularized M-estimators, which we empir-\nically show to outperform many state-of-the-art methods on CRF problems. (b) We provide the\n\ufb01rst proof of asymptotic superlinear convergence for Prox-QN methods without strong convexity,\nbut under a restricted variant of strong convexity, satis\ufb01ed by typical M-estimators, including the\n(cid:96)1-regularized CRF MLEs.\n\n2 Proximal Quasi-Newton Method\nA proximal quasi-Newton approach to solve M-estimators of the form (1) proceeds by iteratively\nconstructing a quadratic approximation of the objective function (1) to \ufb01nd the quasi-Newton direc-\ntion, and then conducting a line search procedure to obtain the next iterate.\nGiven a solution estimate wt at iteration t, the proximal quasi-Newton method computes a descent\ndirection by minimizing the following regularized quadratic model,\n\ndt = arg min\n\u2206\n\ngT\nt \u2206 +\n\n1\n2\n\n\u2206T Bt\u2206 + \u03bb(cid:107)wt + \u2206(cid:107)1\n\n(2)\n\nwhere gt = g(wt) is the gradient of (cid:96)(wt) and Bt is an approximation to the Hessian of (cid:96)(w). Bt\nis usually formulated by the L-BFGS algorithm. This subproblem (2) can be ef\ufb01ciently solved by\nrandomized coordinate descent algorithm as shown in Section 2.2.\nThe next iterate is obtained from the backtracking line search procedure, wt+1 = wt + \u03b1tdt, where\nthe step size \u03b1t is tried over {\u03b20, \u03b21, \u03b22, ...} until the Armijo rule is satis\ufb01ed,\n\nwhere 0 < \u03b2 < 1, 0 < \u03c3 < 1 and \u2206t = gT\n\nt dt + \u03bb((cid:107)wt + dt(cid:107)1 \u2212 (cid:107)wt(cid:107)1).\n\nf (wt + \u03b1tdt) \u2264 f (wt) + \u03b1t\u03c3\u2206t,\n\n2\n\n\f2.1 BFGS update formula\nBt can be ef\ufb01ciently updated by the gradients of the previous iterations according to the BFGS\nupdate [18],\n\nBt = Bt\u22121 \u2212 Bt\u22121st\u22121sT\n\nt\u22121Bt\u22121\n\nt\u22121Bt\u22121st\u22121\nsT\nwhere st = wt+1 \u2212 wt and yt = gt+1 \u2212 gt\nWe use the compact formula for Bt [18],\n\n+\n\nyt\u22121yT\nt\u22121\nt\u22121st\u22121\nyT\n\n(3)\n\nwhere\n\nQ := [ B0St Yt ] , R :=\n\n, \u02c6Q := RQT\n\n(cid:20) ST\n\nBt = B0 \u2212 QRQT = B0 \u2212 Q \u02c6Q,\n\n(cid:21)\u22121\nSt = [s0, s1, ..., st\u22121] , Yt =(cid:2)y0, y1, ..., yt\u22121\n(cid:26)sT\n\nt B0St\nLT\nt\n\nLt\n\u2212Dt\n\ni\u22121yj\u22121\n\n(cid:3)\n\nDt = diag[sT\n\n0 y0, ..., sT\n\nt\u22121yt\u22121] and (Lt)i,j =\n\n0\n\nif i > j\notherwise\n\nIn practical implementation, we apply Limited-memory-BFGS. It only uses the information of the\nmost recent m gradients, so that Q and \u02c6Q have only size, d \u00d7 2m and 2m \u00d7 d, respectively. B0 is\nusually set as \u03b3tI for computing Bt, where \u03b3t = yT\nt\u22121st\u22121[18]. As will be discussed in\nSection 2.3, Q( \u02c6Q) is updated just on the rows(columns) corresponding to the working set, A. The\ntime complexity for L-BFGS update is O(m2|A| + m3).\n\nt\u22121st\u22121/sT\n\n2.2 Coordinate Descent for Inner Problem\nRandomized coordinate descent is carefully employed to solve the inner problem (2) by Tang and\nScheinberg [2]. In the update for coordinate j, d \u2190 d + z\u2217ej, z\u2217 is obtained by solving the one-\ndimensional problem,\n\nz\u2217 = arg min\n\nz\n\n1\n2\n\n(Bt)jjz2 + ((gt)j + (Btd)j)z + \u03bb|(wt)j + dj + z|\n\nThis one-dimensional problem has a closed-form solution, z\u2217 = \u2212c + S(c \u2212 b/a, \u03bb/a) ,where S is\nthe soft-threshold function and a = (Bt)jj, b = (gt)j + (Btd)j and c = (wt)j + dj. For B0 = \u03b3tI,\nthe diagonal of Bt can be computed by (Bt)jj = \u03b3t \u2212 qT\nj is the j-th row of Q and \u02c6qj\nis the j-th column of \u02c6Q. And the second term in b, (Btd)j can be computed by,\n\nj \u02c6qj, where qT\n\n(Btd)j = \u03b3tdj \u2212 qT\n\nj\n\n\u02c6Qd = \u03b3tdj \u2212 qT\n\nj\n\n\u02c6d,\n\nwhere \u02c6d := \u02c6Qd. Since \u02c6d has only 2m dimension, it is fast to update (Btd)j by qj and \u02c6d. In each\ninner iteration, only dj is updated, so we have the fast update of \u02c6d, \u02c6d \u2190 \u02c6d + \u02c6qjz\u2217.\nSince we only update the coordinates in the working set, the above algorithm has only computation\ncomplexity O(m|A| \u00d7 inner iter), where inner iter is the number of iterations used for solving\nthe inner problem.\n\nImplementation\n\n2.3\nIn this section, we discuss several key implementation details used in our algorithm to speed up the\noptimization.\nShrinking Strategy\nIn each iteration, we select an active or working subset A of the set of all variables: only the variables\nin this set are updated in the current iteration. The complementary set, also called the \ufb01xed set, has\nonly values of zero and is not updated. The use of such a shrinking strategy reduces the overall\ncomplexity from O(d) to O(|A|). Speci\ufb01cally, we (a) update the gradients just on the working set,\n(b) update Q ( \u02c6Q) just on the rows(columns) corresponding to the working set, and (c) compute the\nlatest entries in Dt, \u03b3t, Lt and ST\nt St by just using the corresponding working set rather than the\nwhole set.\n\n3\n\n\fThe key facet of our \u201cshrinking strategy\u201d however is in aggressively shrinking the active set: at the\nnext iteration, we set the active set to be a subset of the previous active set, so that At \u2282 At\u22121. Such\nan aggressive shrinking strategy however is not guaranteed to only weed out irrelevant variables.\nAccordingly, we proceed in epochs. In each epoch, we progressively shrink the active set as above,\ntill the iterations seem to converge. At that time, we then allow for all the \u201cshrunk\u201d variables to\ncome back and start a new epoch. Such a strategy was also called an \u0001-cooling strategy by Fan et\nal. [14], where the shrinking stopping criterion is loose at the beginning, and progressively becomes\nmore strict each time all the variables are brought back. For L-BFGS update, when a new epoch\nstarts, the memory of L-BFGS is cleaned to prevent any loss of accuracy.\nBecause at the \ufb01rst iteration of each new epoch, the entire gradient over all coordinates is evalu-\nated, the computation time for those iterations accounts for a signi\ufb01cant portion of the total time\ncomplexity. Fortunately, our experiments show that the number of epochs is typically between 3-5.\nInexact inner problem solution\nLike many other proximal methods, e.g. GLMNET and QUIC, we solve the inner problem inexactly.\nThis reduces the time complexity of the inner problem dramatically. The amount of inexactness is\nbased on a heuristic method which aims to balance the computation time of the inner problem in each\nouter iteration. The computation time of the inner problem is determined by the number of inner\niterations and the size of working set. Thus, we let the number of inner iterations, inner iter =\nmin{max inner,(cid:98)d/|A|(cid:99)}, where max inner = 10 in our experiment.\nData Structure for both model sparsity and data sparsity\nIn our implementation we take two sparsity patterns into consideration: (a) model sparsity, which\naccounts for the fact that most parameters are equal to zero in the optimal solution; and (b) data\nsparsity, wherein most feature values of any particular instance are zeros. We use a feature-indexed\ndata structure to take advantage of both sparsity patterns. Computations involving data will be time-\nconsuming if we compute over all the instances including those that are zero. So we leverage the\nsparsity of data in our experiment by using vectors of pairs, whose members are the index and its\nvalue. Traditionally, each vector represents an instance and the indices in its pairs are the feature\nindices. However, in our implementation, to take both model sparsity and data sparsity into account,\nwe use an inverted data structure, where each vector represents one feature (feature-indexed) and\nthe indices in its pairs are the instance indices. This data structure facilitates the computation of the\ngradient for a particular feature, which involves only the instances related to this feature.\nWe summarize these steps in the algorithm below. And a detailed algorithm is in Appendix 2.\n\nAlgorithm 1 Proximal Quasi-Newton Algorithm (Prox-QN)\nInput: Dataset {x(i), y(i)}i=1,2,...,N , termination criterion \u0001, \u03bb and L-BFGS memory size m.\nOutput: w\u2217 converging to arg minwf (w).\n1: Initialize w \u2190 0, g \u2190 \u2202(cid:96)(w)/\u2202w, working set A \u2190 {1, 2, ...d}, and S, Y , Q, \u02c6Q \u2190 \u03c6.\n2: while termination criterion is not satis\ufb01ed or working set doesn\u2019t contain all the variables do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end while\n\nend if\nSolve inner problem (2) over working set and obtain the new direction d.\nConduct line search based on Armijo rule and obtain new iterate w.\nUpdate g, s, y, S, Y , Q, \u02c6Q and related matrices over working set.\n\nShrink working set.\nif Shrinking stopping criterion is satis\ufb01ed then\n\nTake all the shrunken variables back to working set and clean the memory of L-BFGS.\nUpdate Shrinking stopping criterion and continue.\n\n3 Convergence Analysis\nIn this section, we analyze the convergence behavior of proximal quasi-Newton method in the super-\nlinear convergence phase, where the unit step size is chosen. To simplify the analysis, in this section,\nwe assume the inner problem is solved exactly and no shrinking strategy is employed. We also\nprovide the global convergence proof for Prox-QN method with shrinking strategy in Appendix 1.5.\nIn current literature, the analysis of proximal Newton-type methods relies on the assumption of\n\n4\n\n\fstrongly convex objective function to prove superlinear convergence [3]; otherwise, only sublinear\nrate can be proved [25]. However, our objective (1) is not strongly convex when the dimension is\nvery large or there are redundant features. In particular, the Hessian matrix H(w) of the smooth\nfunction (cid:96)(w) is not positive-de\ufb01nite. We thus leverage a recently introduced restricted variant of\nstrong convexity, termed Constant Nullspace Strong Convexity (CNSC) in [1]. There the authors\nanalyzed the behavior of proximal gradient and proximal Newton methods under such a condition.\nThe proximal quasi-Newton procedure in this paper however requires a subtler analysis, but in a key\ncontribution of the paper, we are nonetheless able to show asymptotic superlinear convergence of\nthe Prox-QN method under this restricted variant of strong convexity.\nDe\ufb01nition 1 (Constant Nullspace Strong Convexity (CNSC)). A composite function (1) is said to\nhave Constant Nullspace Strong Convexity restricted to space T (CNSC-T ) if there is a constant\nvector space T s.t. (cid:96)(w) depends only on projT (w), i.e. (cid:96)(w) = (cid:96)(projT (w)), and its Hessian\nsatis\ufb01es\n(4)\nfor some M \u2265 m > 0, and\n\nm(cid:107)v(cid:107)2 \u2264 vT H(w)v \u2264 M(cid:107)v(cid:107)2, \u2200v \u2208 T ,\u2200w \u2208 Rd\n\nH(w)v = 0, \u2200v \u2208 T \u22a5,\u2200w \u2208 Rd,\n\n(5)\nwhere projT (w) is the projection of w onto T and T \u22a5 is the complementary space orthogonal to\nT .\nThis condition can be seen to be an algebraic condition that is satis\ufb01ed by typical M-estimators con-\nsidered in high-dimensional settings. In this paper, we will abuse the use of CNSC-T for symmetric\nmatrices. We say a symmetric matrix H satis\ufb01es CNSC-T condition if H satis\ufb01es (4) and (5). In\nthe following theorems, we will denote the orthogonal basis of T as U \u2208 Rd\u00d7 \u02c6d, where \u02c6d \u2264 d is\nthe dimensionality of T space and U T U = I. Then the projection to T space can be written as\nprojT (w) = U U T w.\nTheorem 1 (Asymptotic Superlinear Convergence). Assume \u22072(cid:96)(w) and \u2207(cid:96)(w) are Lipschitz con-\ntinuous. Let Bt be the matrices generated by BFGS update (3). Then if (cid:96)(w) and Bt satisfy CNSC-T\ncondition, the proximal quasi-Newton method has q-superlinear convergence:\n\n(cid:107)zt+1 \u2212 z\u2217(cid:107) \u2264 o ((cid:107)zt \u2212 z\u2217(cid:107)) ,\n\nwhere zt = U T wt, z\u2217 = U T w\u2217 and w\u2217 is an optimal solution of (1).\nThe proof is given in Appendix 1.4. We prove it by exploiting the CNSC-T property. First, we\nre-build our problem and algorithm on the reduced space Z = {z \u2208 R \u02c6d|z = U T w}, where\nthe strong-convexity property holds. Then we prove the asymptotic superlinear convergence on Z\nfollowing Theorem 3.7 in [26].\nTheorem 2. For Lipschitz continuous (cid:96)(w), the sequence {wt} produced by the proximal quasi-\nNewton Method in the super-linear convergence phase has\n\nf (wt) \u2212 f (w\u2217) \u2264 L(cid:107)zt \u2212 z\u2217(cid:107),\n\n\u221a\nwhere L = L(cid:96) + \u03bb\nThe proof is also in Appendix 1.4. It is proved by showing that both the smooth part and the non-\ndifferentiable part satisfy the modi\ufb01ed Lipschitz continuity.\n\nd, L(cid:96) is the Lipschitz constant of (cid:96)(w), zt = U T wt and z\u2217 = U T w\u2217.\n\n(6)\n\n4 Application to Conditional Random Fields with (cid:96)1 Penalty\nIn CRF problems, we are interested in learning a conditional distribution of labels y \u2208 Y given\nobservation x \u2208 X , where y has application-dependent structure such as sequence, tree, or table in\nwhich label assignments have inter-dependency. The distribution is of the form\n\nPw(y|x) =\n\n1\n\nZw(x)\n\nexp\n\nwkfk(y, x)\n\n,\n\n(cid:41)\n\n(cid:40) d(cid:88)\n\nk=1\n\nwhere fk is the feature functions, wk is the associated weight, d is the number of feature functions\nand Zw(x) is the partition function. Given a training data set {(xi, yi)}N\ni=1, our goal is to \ufb01nd the\noptimal weights w such that the following (cid:96)1-regularized negative log-likelihood is minimized.\n\nlog Pw(y(i)|x(i))\n\n(7)\n\nf (w) = \u03bb(cid:107)w(cid:107)1 \u2212 N(cid:88)\n\nmin\nw\n\ni=1\n\n5\n\n\fSince |Y|, the number of possible values y takes, can be exponentially large, the evaluation of\n(cid:96)(w) and the gradient \u2207(cid:96)(w) needs application-dependent oracles to conduct the summation over\nY. For example, in sequence labeling problem, a dynamic programming oracle, forward-backward\nalgorithm, is usually employed to compute \u2207(cid:96)(w). Such an oracle can be very expensive. In Prox-\nQN algorithm for sequence labeling problem, the forward-backward algorithm takes O(|Y |2N T \u00d7\nexp) time, where exp is the time for the expensive exponential computation, T is the sequence\nlength and Y is the possible label set for a symbol in the sequence. Then given the obtained oracle,\nthe evaluation of the partial gradients over the working set A has time complexity, O(Dnnz|A|T ),\nwhere Dnnz is the average number of instances related to a feature. Thus when O(|Y |2N T \u00d7 exp +\nDnnz|A|T ) > O(m3 + m2|A|), the gradients evaluation time will dominate.\nThe following theorem gives that the (cid:96)1-regularized CRF MLEs satisfy the CNSC-T condition.\n\nTheorem 3. With (cid:96)1 penalty, the CRF loss function, (cid:96)(w) = \u2212(cid:80)N\ni=1 log Pw(y(i)|x(i)), satis\ufb01es\nthe CNSC-T condition with T = N \u22a5, where N = {v \u2208 Rd|\u03a6T v = 0} is a constant subspace of\n(cid:105)\nRd and \u03a6 \u2208 Rd\u00d7(N|Y|) is de\ufb01ned as below,\n\n(cid:104)\n\nwhere n = (i \u2212 1)|Y| + l, l = 1, 2, ...|Y| and E is the expectation over the conditional probability\nPw(y|x(i)).\nAccording to the de\ufb01nition of CNSC-T condition, the (cid:96)1-regularized CRF MLEs don\u2019t satisfy\nthe classical strong-convexity condition when N has non-zero members, which happens in the\nfollowing two cases:\nfor any in-\nstance i there exist a non-zero vector a and a constant bi such that (cid:104)a, \u03c6(y, x(i))(cid:105) = bi, where\n\u03c6(y, x) = [f1(y, x(i)), f2(y, x(i)), ..., fd(y, x(i))]T ; (ii) d > N|Y|, i.e., the number of feature\nfunctions is very large. The \ufb01rst case holds in many problems, like the sequence labeling and hi-\nerarchical classi\ufb01cation discussed in Section 6, and the second case will hold in high-dimensional\nproblems.\n\n(i) the exponential representation is not minimal [27], i.e.\n\n\u03a6kn = fk(yl, x(i)) \u2212 E\n\nfk(y, x(i))\n\n5 Related Methods\nThere have been several methods proposed for solving (cid:96)1-regularized M-estimators of the form in\n(7). In this section, we will discuss these in relation to our method.\nOrthant-Wise Limited-memory Quasi-Newton (OWL-QN) introduced by Andrew and Gao [23]\nextends L-BFGS to (cid:96)1-regularized problems. In each iteration, OWL-QN computes a generalized\ngradient called pseudo-gradient to determine the orthant and the search direction, then does a line\nsearch and a projection of the new iterate back to the orthant. Due to its fast convergence, it is\nwidely implemented by many software packages, such as CRF++, CRFsuite and Wapiti. But OWL-\nQN does not take advantage of the model sparsity in the optimization procedure, and moreover Yu\net al. [22] have raised issues with its convergence proof.\nStochastic Gradient Descent (SGD) uses the gradient of a single sample as the search direction\nat each iteration. Thus, the computation for each iteration is very fast, which leads to fast conver-\ngence at the beginning. However, the convergence becomes slower than the second-order method\nwhen the iterate is close to the optimal solution. Recently, an (cid:96)1-regularized SGD algorithm pro-\nposed by Tsuruoka et al.[21] is claimed to have faster convergence than OWL-QN. It incorporates\n(cid:96)1-regularization by using a cumulative (cid:96)1 penalty, which is close to the (cid:96)1 penalty received by the\nparameter if it had been updated by the true gradient. Tsuruoka et al. do consider data sparsity, i.e.\nfor each instance, only the parameters related to the current instance are updated. But they too do\nnot take the model sparsity into account.\nCoordinate Descent (CD) and Blockwise Coordinate Descent (BCD) are popular methods for (cid:96)1-\nregularized problem. In each coordinate descent iteration, it solves an one-dimensional quadratic\napproximation of the objective function, which has a closed-form solution. It requires the second\npartial derivative with respect to the coordinate. But as discussed by Sokolovska et al., the exact\nsecond derivative in CRF problem is intractable. So they instead use an approximation of the second\nderivative, which can be computed ef\ufb01ciently by the same inference oracle queried for the gradient\nevaluation. However, pure CD is very expensive because it requires to call the inference oracle for\nthe instances related to the current coordinate in each coordinate update. BCD alleviates this prob-\nlem by grouping the parameters with the same x feature into a block. Then each block update only\n\n6\n\n\fneeds to call the inference oracle once for the instances related to the current x feature. However,\nit cannot alleviate the large number of inference oracle calls unless the data is very sparse such that\nevery instance appears only in very few blocks.\nProximal Newton method has proven successful on problems of (cid:96)1-regularized logistic regression\n[13] and Sparse Invariance Covariance Estimation [5], where the Hessian-vector product can be\ncheaply re-evaluated for each update of coordinate. However, the Hessian-vector product for CI\nfunction like CRF requires the query of the inference oracle no matter how many coordinates are\nupdated at a time [17], which then makes the coordinate update on quadratic approximation as ex-\npensive as coordinate update in the original problem. Our proximal quasi-Newton method avoids\nsuch problem by replacing Hessian with a low-rank matrix from BFGS update.\n\n6 Numerical Experiments\nWe compare our approach, Prox-QN, with four other methods, Proximal Gradient (Prox-GD), OWL-\nQN [23], SGD [21] and BCD [16]. For OWL-QN, we directly use the OWL-QN optimizer devel-\noped by Andrew et al.1, where we set the memory size as m = 10, which is the same as that in\nProx-QN. For SGD, we implement the algorithm proposed by Tsuruoka et al. [21], and use cumu-\nlative (cid:96)1 penalty with learning rate \u03b7k = \u03b70/(1 + k/N ), where k is the SGD iteration and N is\nthe number of samples. For BCD, we follow Sokolovska et al. [16] but with three modi\ufb01cations.\nFirst, we add a line search procedure in each block update since we found it is required for conver-\ngence. Secondly, we apply shrinking strategy as discussed in Section 2.3. Thirdly, when the second\nderivative for some coordinate is less than 10\u221210, we set it to be 10\u221210 because otherwise the lack\nof (cid:96)2-regularization in our problem setting will lead to a very large new iterate.\nWe evaluate the performance of Prox-QN method on two problems, sequence labeling and hierar-\nchical classi\ufb01cation. In particular, we plot the relative objective difference (f (wt)\u2212f (w\u2217))/f (w\u2217)\nand the number of non-zero parameters (on a log scale) against time in seconds. More experiment\nresults, for example, the testing accuracy and the performance for different \u03bb\u2019s, are in Appendix\n5. All the experiments are executed on 2.8GHz Intel Xeon E5-2680 v2 Ivy Bridge processor with\n1/4TB memory and Linux OS.\n\n6.1 Sequence Labeling\nIn sequence labeling problems, each instance (x, y) = {(xt, yt)}t=1,2...,T is a sequence of T pairs\nof observations and the corresponding labels. Here we consider the optical character recognition\n(OCR) problem, which aims to recognize the handwriting words. The dataset 2 was preprocessed by\nTaskar et al. [19] and was originally collected by Kassel [20], and contains 6877 words (instances).\nWe randomly divide the dataset into two part: training part with 6216 words and testing part with 661\nwords. The character label set Y consists of 26 English letters and the observations are characters\nwhich are represented by images of 16 by 8 binary pixels as shown in Figure 1(a). We use degree\n2 pixels as the raw features, which means all pixel pairs are considered. Therefore, the number of\nraw features is J = 128 \u00d7 127/2 + 128 + 1, including a bias. For degree 2 features, xtj = 1\nonly when both pixels are 1 and otherwise xtj = 0, where xtj is the j-th raw feature of xi. For\nthe feature functions, we use unigram feature functions 1(yt = y, xtj = 1) and bigram feature\nfunctions 1(yt = y, yt+1 = y(cid:48)) with their associated weights, \u0398y,j and \u039by,y(cid:48), respectively. So\nw = {\u0398, \u039b} for \u0398 \u2208 R|Y |\u00d7J and \u039b \u2208 R|Y |\u00d7|Y | and the total number of parameters, d = |Y |2 +\n|Y | \u00d7 J = 215, 358. Using the above feature functions, the potential function can be speci\ufb01ed as,\n,where (cid:104)\u00b7,\u00b7(cid:105) is the sum of element-\n\u02dcPw(y, x) = exp\nwise product and ey \u2208 R|Y | is an unit vector with 1 at y-th entry and 0 at other entries. The gradient\nand the inference oracle are given in Appendix 4.1.\nIn our experiment, \u03bb is set as 100, which leads to a relative high testing accuracy and an optimal\nsolution with a relative small number of non-zero parameters (see Appendix 5.2). The learning rate\n\u03b70 for SGD is tuned to be 2 \u00d7 10\u22124 for best performance. In BCD, the unigram parameters are\ngrouped into J blocks according to the x features while the bigram parameters are grouped into one\nblock. Our proximal quasi-Newton method can be seen to be much faster than the other methods.\n\nt )(cid:105) + (cid:104)\u0398,(cid:80)T\u22121\n\n(cid:110)(cid:104)\u039b,(cid:80)T\n\n)(cid:105)(cid:111)\n\nt=1(eytxT\n\nt=1 (eyteT\n\nyt+1\n\n1http://research.microsoft.com/en-us/downloads/b1eb1016-1738-4bd5-83a9-370c9d498a03/\n2http://www.seas.upenn.edu/ taskar/ocr/\n\n7\n\n\f(a) Graphical model of OCR\n\n(b) Relative Objective Difference\n\n(c) Non-zero Parameters\n\nFigure 1: Sequence Labeling Problem\n\n(cid:111)\n\nk\u2208Path(y) wT\n\nk x\n\n(cid:110)(cid:80)\n\n6.2 Hierarchical Classi\ufb01cation\nIn hierarchical classi\ufb01cation problems, we have a label taxonomy, where the classes are grouped\ninto a tree as shown in Figure 2(a). Here y \u2208 Y is one of the leaf nodes. If we have totally K\nclasses (number of nodes) and J raw features, then the number of parameters is d = K \u00d7 J. Let\nW \u2208 RK\u00d7J denote the weights. The feature function corresponding to Wk,j is fk,j(y, x) = 1[k \u2208\nPath(y)]xj, where k \u2208 Path(y) means class k is an ancestor of y or y itself. The potential function is\n\u02dcPW (y, x) = exp\nk is the weight vector of k-th class, i.e. the k-th row\nof W . The gradient and the inference oracle are given in Appendix 4.2.\nThe dataset comes from Task1 of the dry-run dataset of LSHTC13. It has 4,463 samples, each with\nJ=51,033 raw features. The hierarchical tree has 2,388 classes which includes 1,139 leaf labels.\nThus, the number of the parameters d =121,866,804. The feature values are scaled by svm-scale\nprogram in the LIBSVM package. We set \u03bb = 1 to achieve a relative high testing accuracy and\nhigh sparsity of the optimal solution. The SGD initial learning rate is tuned to be \u03b70 = 10 for best\nperformance. In BCD, parameters are grouped into J blocks according to the raw features.\n\nwhere wT\n\n(a) Label Taxonomy\n\n(b) Relative Objective Difference\n\n(c) Non-zero Parameters\n\nFigure 2: Hierarchical Classi\ufb01cation Problem\n\nAs both Figure 1(b),1(c) and Figure 2(b),2(c) show, Prox-QN achieves much faster convergence and\nmoreover obtains a sparse model in much less time.\n\nAcknowledgement\nThis research was supported by NSF grants CCF-1320746 and CCF-1117055. P.R. acknowledges\nthe support of ARO via W911NF-12-1-0390 and NSF via IIS-1149803, IIS-1320894, IIS-1447574,\nand DMS-1264033. K.Z. acknowledges the support of the National Initiative for Modeling and\nSimulation fellowship\n\n3http://lshtc.iit.demokritos.gr/node/1\n\n8\n\n05001000150010\u2212810\u2212610\u2212410\u22122time(s)Relative\u2212objective\u2212differenceSequence\u2212Labelling\u2212100 BCDOWL\u2212QNProx\u2212GDProx\u2212QNSGD050010001500103104105time(s)nnzSequence\u2212Labelling\u2212nnz\u2212100 BCDOWL\u2212QNProx\u2212GDProx\u2212QNSGD20004000600080001000010\u2212610\u2212410\u22122100time(s)Relative\u2212objective\u2212differenceHierarchical\u2212Classification\u22121 BCDOWL\u2212QNProx\u2212GDProx\u2212QNSGD500100015002000250030003500104105time(s)nnzHierarchicial\u2212Classification\u2212nnz\u22121 BCDOWL\u2212QNProx\u2212GDProx\u2212QNSGD\fReferences\n[1] I. E.H. Yen, C.-J. Hsieh, P. Ravikumar, and I. S. Dhillon. Constant Nullspace Strong Convexity and Fast\n\nConvergence of Proximal Methods under High-Dimensional Settings. In NIPS 2014.\n\n[2] X. Tang and K. Scheinberg. Ef\ufb01ciently Using Second Order Information in Large l1 Regularization Prob-\n\nlems. arXiv:1303.6935, 2013.\n\n[3] J. D. Lee, Y. Sun, and M. A. Saunders. Proximal newton-type methods for minimizing composite func-\n\ntions. In NIPS 2012.\n\n[4] M. Schmidt, E. Van Den Berg, M.P. Friedlander, and K. Murphy. Optimizing costly functions with simple\nconstraints: A limited-memory projected Quasi-Newton algorithm. In Int. Conf. Artif. Intell. Stat., 2009.\n[5] C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse inverse covariance estimation using\n\nquadratic approximation. In NIPS 2011.\n\n[6] S. Boyd and L. Vandenberghe. Convex Optimization, Cambridge Univ. Press, Cambridge, U.K., 2003.\n[7] P.-W. Wang and C.-J. Lin. Iteration Complexity of Feasible Descent Methods for Convex Optimization.\n\nTechnical report, Department of Computer Science, National Taiwan University, Taipei, Taiwan, 2013.\n\n[8] G.-X. Yuan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. A comparison of optimization methods and soft-\nware for large-scale l1-regularized linear classi\ufb01cation. Journal of Machine Learning Research (JMLR),\n11:3183-3234, 2010.\n\n[9] A. Agarwal, S. Negahban, and M. Wainwright. Fast Global Convergence Rates of Gradient Methods for\n\nHigh-Dimensional Statistical Recovery. In NIPS 2010.\n\n[10] K. Hou, Z. Zhou, A. M.-S. So, and Z.-Q. Luo. On the linear convergence of the proximal gradient method\n\nfor trace norm regularization. In NIPS 2014.\n\n[11] L. Xiao and T. Zhang. A proximal-gradient homotopy method for the l1-regularized least-squares prob-\n\nlem. In ICML 2012.\n\n[12] P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimization, Math.\n\nProg. B. 117, 2009.\n\n[13] G.-X. Yuan, C.-H. Ho, and C.-J. Lin. An improved GLMNET for l1-regularized logistic regression,\n\nJMLR, 13:1999-2030, 2012.\n\n[14] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear\n\nclassi\ufb01cation, JMLR, 9:1871-1874, 2008.\n\n[15] A. J Hoffman. On approximate solutions of systems of linear inequalities. Journal of Research of the\n\nNational Bureau of Standards, 1952.\n\n[16] N. Sokolovska, T. Lavergne, O. Cappe, and F. Yvon. Ef\ufb01cient Learning of Sparse Conditional Random\n\nFields for Supervised Sequence Labelling. arXiv:0909.1308, 2009.\n\n[17] Y. Tsuboi, Y. Unno, H. Kashima, and N. Okazaki. Fast Newton-CG Method for Batch Learning of Con-\nditional Random Fields, Proceedings of the Twenty-Fifth AAAI Conference on Arti\ufb01cial Intelligence,\n2011.\n\n[18] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer,\n\nNew York, NY, USA, 2nd edition, 2006.\n\n[19] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS 2003.\n[20] R. Kassel. A Comparison of Approaches to On-line Handwritten Character Recognition. PhD thesis, MIT\n\nSpoken Language Systems Group, 1995.\n\n[21] Y. Tsuruoka, J. Tsujii, and S. Ananiadou. Stochastic gradient descent training for l1- regularized log-linear\nmodels with cumulative penalty. In Proceedings of the Joint Conference of the 47th Annual Meeting of\nthe ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages\n477-485, Suntec, Singapore, 2009.\n\n[22] J. Yu, S.V.N. Vishwanathan, S. Gunter, and N. N. Schraudolph. A Quasi-Newton approach to nonsmooth\n\nconvex optimization problems in machine learning, JMLR, 11:1-57, 2010.\n\n[23] G. Andrew and J. Gao. Scalable training of (cid:96)1-regularized log-linear models. In ICML 2007.\n[24] J.E. Dennis and J.J. More. A characterization of superlinear convergence and its application to Quasi-\n\nNewton methods. Math. Comp., 28(126):549560, 1974.\n\n[25] K. Scheinberg and X. Tang. Practical Inexact Proximal Quasi-Newton Method with Global Complexity\n\nAnalysis. COR@L Technical Report at Lehigh University. arXiv:1311.6547, 2013\n\n[26] J. D. Lee, Y. Sun, and M. A. Saunders. Proximal Newton-type methods for minimizing composite func-\n\ntions. arXiv:1206.1623, 2012\n\n[27] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nTechnical Report 649, Dept. Statistics, Univ. California, Berkeley. 2003\n\n9\n\n\f", "award": [], "sourceid": 1237, "authors": [{"given_name": "Kai", "family_name": "Zhong", "institution": "University of Texas at Austin"}, {"given_name": "Ian En-Hsu", "family_name": "Yen", "institution": "UT-Austin"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "UT Austin"}]}