{"title": "Structure Regularization for Structured Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 2402, "page_last": 2410, "abstract": "While there are many studies on weight regularization, the study on structure regularization is rare. Many existing systems on structured prediction focus on increasing the level of structural dependencies within the model. However, this trend could have been misdirected, because our study suggests that complex structures are actually harmful to generalization ability in structured prediction. To control structure-based overfitting, we propose a structure regularization framework via \\emph{structure decomposition}, which decomposes training samples into mini-samples with simpler structures, deriving a model with better generalization power. We show both theoretically and empirically that structure regularization can effectively control overfitting risk and lead to better accuracy. As a by-product, the proposed method can also substantially accelerate the training speed. The method and the theoretical results can apply to general graphical models with arbitrary structures. Experiments on well-known tasks demonstrate that our method can easily beat the benchmark systems on those highly-competitive tasks, achieving record-breaking accuracies yet with substantially faster training speed.", "full_text": "Structure Regularization for Structured Prediction\n\n\u2217MOE Key Laboratory of Computational Linguistics, Peking University\n\n\u2020School of Electronics Engineering and Computer Science, Peking University\n\nXu Sun\u2217\u2020\n\nxusun@pku.edu.cn\n\nAbstract\n\nWhile there are many studies on weight regularization, the study on structure reg-\nularization is rare. Many existing systems on structured prediction focus on in-\ncreasing the level of structural dependencies within the model. However, this trend\ncould have been misdirected, because our study suggests that complex structures\nare actually harmful to generalization ability in structured prediction. To control\nstructure-based over\ufb01tting, we propose a structure regularization framework via\nstructure decomposition, which decomposes training samples into mini-samples\nwith simpler structures, deriving a model with better generalization power. We\nshow both theoretically and empirically that structure regularization can effec-\ntively control over\ufb01tting risk and lead to better accuracy. As a by-product, the\nproposed method can also substantially accelerate the training speed. The method\nand the theoretical results can apply to general graphical models with arbitrary\nstructures. Experiments on well-known tasks demonstrate that our method can\neasily beat the benchmark systems on those highly-competitive tasks, achieving\nstate-of-the-art accuracies yet with substantially faster training speed.\n\n1 Introduction\n\nStructured prediction models are popularly used to solve structure dependent problems in a wide\nvariety of application domains including natural language processing, bioinformatics, speech recog-\nnition, and computer vision. Recently, many existing systems on structured prediction focus on\nincreasing the level of structural dependencies within the model. We argue that this trend could\nhave been misdirected, because our study suggests that complex structures are actually harmful to\nmodel accuracy. While it is obvious that intensive structural dependencies can effectively incorpo-\nrate structural information, it is less obvious that intensive structural dependencies have a drawback\nof increasing the generalization risk, because more complex structures are easier to suffer from\nover\ufb01tting. Since this type of over\ufb01tting is caused by structure complexity, it can hardly be solved\nby ordinary regularization methods such as L2 and L1 regularization schemes, which is only for\ncontrolling weight complexity.\nTo deal with this problem, we propose a simple structure regularization solution based on tag struc-\nture decomposition. The proposed method decomposes each training sample into multiple mini-\nsamples with simpler structures, deriving a model with better generalization power. The proposed\nmethod is easy to implement, and it has several interesting properties: (1) We show both theoretical-\nly and empirically that the proposed method can reduce the over\ufb01t risk. (2) Keeping the convexity of\nthe objective function: a convex function with a structure regularizer is still convex. (3) No con\ufb02ict\nwith the weight regularization: we can apply structure regularization together with weight regular-\nization. (4) Accelerating the convergence rate in training. (5) This method can be used for different\ntypes of models, including CRFs [6] and perceptrons [3].\nThe term structural regularization has been used in prior work for regularizing structures of features,\nincluding spectral regularization [1], regularizing feature structures for classi\ufb01ers [23], and many\n\n1\n\n\frecent studies on structured sparsity in structured prediction scenarios [13, 9], via adopting mixed\nnorm regularization [11], Group Lasso [25], and posterior regularization [5]. Compared with those\nprior work, we emphasize that our proposal on tag structure regularization is novel. This is because\nthe term structure in all of the aforementioned work refers to structures of feature space, which\nis substantially different compared with our proposal on regularizing tag structures (interactions\namong tags).\nThere are other related studies, including the studies of [20] and [12] on piecewise/decomposed\ntraining methods, and the study of [22] on a \u201clookahead\" learning method. Our work differs from\n[20, 12, 22] mainly because our work is built on a regularization framework, with arguments and\njusti\ufb01cations on reducing generalization risk and for better accuracy. Also, our method and the the-\noretical results can \ufb01t general graphical models with arbitrary structures, and the detailed algorithm\nis quite different. On generalization risk analysis, related studies include [2, 14] on non-structured\nclassi\ufb01cation and [21, 8, 7] on structured classi\ufb01cation.\nTo the best of our knowledge, this is the \ufb01rst theoretical result on quantifying the relation between\nstructure complexity and the generalization risk in structured prediction, and this is also the \ufb01rst\nproposal on structure regularization via regularizing tag-interactions. The contributions of this work1\nare two-fold:\n\n\u2022 On the methodology side, we propose a structure regularization framework for structured\nprediction. We show both theoretically and empirically that the proposed method can ef-\nfectively reduce the over\ufb01tting risk, and at the same time accelerate the convergence rate in\ntraining. Our method and the theoretical analysis do not make assumptions based on specif-\nic structures. In other words, the method and the theoretical results can apply to graphical\nmodels with arbitrary structures, including linear chains, trees, and general graphs.\n\u2022 On the application side, for several important natural language processing tasks, our simple\nmethod can easily beat the benchmark systems on those highly-competitive tasks, achieving\nrecord-breaking accuracies as well as substantially faster training speed.\n\n2 Structure Regularization\n\nA graph of observations (even with arbitrary structures) can be indexed and be denoted by using\nan indexed sequence of observations OOO = {o1, . . . , on}. We use the term sample to denote OOO =\n{o1, . . . , on}. For example, in natural language processing, a sample may correspond to a sentence\nof n words with dependencies of tree structures (e.g., in syntactic parsing). For simplicity in analysis,\nwe assume all samples have n observations (thus n tags). In a typical setting of structured prediction,\nall the n tags have inter-dependencies via connecting each Markov dependency between neighboring\ntags. Thus, we call n as tag structure complexity or simply structure complexity below.\nA sample is converted to an indexed sequence of feature vectors xxx = {xxx(1), . . . , xxx(n)}, where xxx(k) \u2208\nX is of the dimension d and corresponds to the local features extracted from the position/index k.\nWe can use an n\u00d7 d matrix to represent xxx \u2208 X n. Let Z = (X n,Y n) and let zzz = (xxx, yyy) \u2208 Z denote\na sample in the training data. Suppose a training set is S = {zzz1 = (xxx1, yyy1), . . . , zzzm = (xxxm, yyym)},\nwith size m, and the samples are drawn i.i.d. from a distribution D which is unknown. A learning\nalgorithm is a function G : Z m 7\u2192 F with the function space F \u2282 {X n 7\u2192 Y n}, i.e., G maps a\ntraining set S to a function GS : X n 7\u2192 Y n. We suppose G is symmetric with respect to S, so that\nG is independent on the order of S.\nStructural dependencies among tags are the major difference between structured prediction and non-\nstructured classi\ufb01cation. For the latter case, a local classi\ufb01cation of g based on a position k can be\nexpressed as g(xxx(k\u2212a), . . . , xxx(k+a)), where the term {xxx(k\u2212a), . . . , xxx(k+a)} represents a local win-\ndow. However, for structured prediction, a local classi\ufb01cation on a position depends on the whole\ninput xxx = {xxx(1), . . . , xxx(n)} rather than a local window, due to the nature of structural dependencies\namong tags (e.g., graphical models like CRFs). Thus, in structured prediction a local classi\ufb01cation\non k should be denoted as g(xxx(1), . . . , xxx(n), k). To simplify the notation, we de\ufb01ne\n\ng(xxx, k) , g(xxx(1), . . . , xxx(n), k)\n\n1See the code at http://klcl.pku.edu.cn/member/sunxu/code.htm\n\n2\n\n\fFigure 1: An illustration of structure regularization in simple linear chain case, which decompose a\ntraining sample zzz with structure complexity 6 into three mini-samples with structure complexity 2.\nStructure regularization can apply to more general graphs with arbitrary dependencies.\nWe de\ufb01ne point-wise cost function c : Y\u00d7Y 7\u2192 R+ as c[GS(xxx, k), yyy(k)], which measures the cost on\na position k by comparing GS(xxx, k) and the gold-standard tag yyy(k), and we introduce the point-wise\nloss as\nThen, we de\ufb01ne sample-wise cost function C : Y n \u00d7 Y n 7\u2192 R+, which is the cost function with\nrespect to a whole sample, and we introduce the sample-wise loss as\n\n\u2113(GS, zzz, k) , c[GS(xxx, k), yyy(k)]\n\nn\u2211\n\nn\u2211\n\nL(GS, zzz) , C[GS(xxx), yyy] =\n\n\u2113(GS, zzz, k) =\n\nc[GS(xxx, k), yyy(k)]\n\nk=1\n\nk=1\n\nGiven G and a training set S, what we are most interested in is the generalization risk in structured\nprediction (i.e., expected average loss) [21, 8]:\n\nSince the distribution D is unknown, we have to estimate R(GS) by using the empirical risk:\n\n[L(GS, zzz)\n]\nm\u2211\n\nn\n\nn\u2211\n\nR(GS) = Ezzz\n\nm\u2211\n\ni=1\n\nRe(GS) =\n\n1\nmn\n\nL(GS, zzzi) =\n\n1\nmn\n\n\u2113(GS, zzzi, k)\n\ni=1\n\nk=1\n\nTo state our theoretical results, we must describe several quantities and assumptions following prior\nwork [2, 14]. We assume a simple real-valued structured prediction scheme such that the class\npredicted on position k of xxx is the sign of GS(xxx, k) \u2208 D.2 Also, we assume the point-wise cost\nfunction c\u03c4 is convex and \u03c4-smooth such that \u2200y1, y2 \u2208 D,\u2200y\n\n|c\u03c4 (y1, y\n\n\u2217\n\n) \u2212 c\u03c4 (y2, y\n\n\u2217\n\n(1)\nAlso, we use a value \u03c1 to quantify the bound of |GS(xxx, k) \u2212 GS\\i (xxx, k)| while changing a single\n\u2032 \u2264 n) in the training set with respect to the structured input xxx. This \u03c1-admissible\nsample (with size n\nassumption can be formulated as \u2200k,\n\n\u2217 \u2208 Y\n)| \u2264 \u03c4|y1 \u2212 y2|\n\n|GS(xxx, k) \u2212 GS\\i(xxx, k)| \u2264 \u03c1||GS \u2212 GS\\i||2 \u00b7 ||xxx||2\n\n(2)\n\nwhere \u03c1 \u2208 R+ is a value related to the design of algorithm G.\n\n2.1 Structure Regularization\n\nMost existing regularization techniques are for regularizing model weights/parameters (e.g., a rep-\nresentative regularizer is the Gaussian regularizer or so called L2 regularizer), and we call such\nregularization techniques as weight regularization.\nDe\ufb01nition 1 (Weight regularization) Let N\u03bb : F 7\u2192 R+ be a weight regularization function on\nF with regularization strength \u03bb, the structured classi\ufb01cation based objective function with general\nweight regularization is as follows:\n\nR\u03bb(GS) , Re(GS) + N\u03bb(GS)\n\n(3)\n\n2Many popular structured prediction models have a convex and real-valued cost function (e.g., CRFs).\n\n3\n\ny(1) x(1) y(2) y(3) y(4) y(5) y(6) y(1) y(2) y(3) y(4) y(5) y(6) x(2) x(3) x(4) x(5) x(6) x(1) x(2) x(3) x(4) x(5) x(6) \fAlgorithm 1 Training with structure regularization\n1: Input: model weights www, training set S, structure regularization strength \u03b1\n2: repeat\n\u2032 \u2190 \u2205\n3:\nS\nfor i = 1 \u2192 m do\n4:\nRandomly decompose zzzi \u2208 S into mini-samples N\u03b1(zzzi) = {zzz(i,1), . . . , zzz(i,\u03b1)}\n5:\n\u2032 \u2190 S\n6:\n7:\n8:\n9:\n10:\n11:\n12: until Convergence\n13: return www\n\nend for\nfor i = 1 \u2192 |S\nSample zzz\nwww \u2190 www \u2212 \u03b7\u2207gzzz\u2032(www)\n\n\u2032| do\n\u2032 uniformly at random from S\n\n\u2032, with gradient \u2207gzzz\u2032(www)\n\n\u2032 \u222a N\u03b1(zzzi)\n\nend for\n\nS\n\nWhile weight regularization is normalizing model weights, the proposed structure regularization\nmethod is normalizing the structural complexity of the training samples. As illustrated in Figure 1,\nour proposal is based on tag structure decomposition, which can be formally de\ufb01ned as follows:\nDe\ufb01nition 2 (Structure regularization) Let N\u03b1 : F 7\u2192 F be a structure regularization function\non F with regularization strength \u03b1 with 1 \u2264 \u03b1 \u2264 n, the structured classi\ufb01cation based objective\nfunction with structure regularization is as follows3:\n\n1\nmn\n\nR\u03b1(GS) , Re[GN(cid:11)(S)] =\n\n(4)\nwhere N\u03b1(zzzi) randomly splits zzzi into \u03b1 mini-samples {zzz(i,1), . . . , zzz(i,\u03b1)}, so that the mini-samples\nhave a distribution on their sizes (structure complexities) with the expected value n\n= n/\u03b1. Thus,\nwe get\n\n\u2113[GS\u2032, zzz(i,j), k]\n\nk=1\n\nj=1\n\nj=1\n\ni=1\n\ni=1\n\n1\nmn\n\nL[GS\u2032, zzz(i,j)] =\n\n\u2032\n\n, . . . , zzz(m,1), zzz(m,2), . . . , zzz(m,\u03b1)\n\n(5)\n\n}\n\n}\n\nm\u2211\n\n\u03b1\u2211\n\nm\u2211\n\n\u03b1\u2211\n\nn/\u03b1\u2211\n\nWhen the structure regularization strength \u03b1 = 1, we have S\n= S and R\u03b1 = Re. The structure\nregularization algorithm (with the stochastic gradient descent setting) is summarized in Algorithm\n1. Recall that xxx = {xxx(1), . . . , xxx(n)} represents feature vectors. Thus, it should be emphasized that\nthe decomposition of xxx is the decomposition of the feature vectors, not the original observations.\nActually the decomposition of the feature vectors is more convenient and has no information loss \u2014\ndecomposing observations needs to regenerate features and may lose some features.\nThe structure regularization has no con\ufb02ict with the weight regularization, and the structure regular-\nization can be applied together with the weight regularization.\n\nDe\ufb01nition 3 (Structure & weight regularization) By combining structure regularization in Def-\ninition 2 and weight regularization in De\ufb01nition 1, the structured classi\ufb01cation based objective\nfunction is as follows:\n\nR\u03b1,\u03bb(GS) , R\u03b1(GS) + N\u03bb(GS)\n\n(7)\n\nWhen \u03b1 = 1, we have R\u03b1,\u03bb = Re(GS) + N\u03bb(GS) = R\u03bb.\n\nLike existing weight regularization methods, currently our structure regularization is only for the\ntraining stage. Currently we do not use structure regularization in the test stage.\n\n3The notation N is overloaded here. For clarity throughout, N with subscript (cid:21) refers to weight regulariza-\n\ntion function, and N with subscript (cid:11) refers to structure regularization function.\n\n4\n\nwith m\u03b1 mini-samples with expected structure complexity n/\u03b1. We can denote S\nas S\n\n= {zzz\n\n\u2032\n2, . . . , zzz\n\n\u2032\n1, zzz\n\n\u2032\n\n\u2032\n\nS\n\n|\n\n|\n}\n= {zzz(1,1), z(1,2), . . . , zzz(1,\u03b1)\nm\u03b1\u2211\n} and R\u03b1(GS) can be simpli\ufb01ed as\n\n{z\nm\u03b1\u2211\n\n\u2032\nm\u03b1\n\n\u03b1\n\n{z\nn/\u03b1\u2211\n\n\u03b1\n\nR\u03b1(GS) , 1\nmn\n\ni=1\n\nL(GS\u2032, zzz\n\n\u2032\ni) =\n\n1\nmn\n\n\u2113[GS\u2032 , zzz\n\nk=1\n\ni=1\n\u2032\n\n\u2032 more compactly\n\n\u2032\ni, k]\n\n(6)\n\n\f2.2 Reduction of Generalization Risk\n\nIn contrast to the simplicity of the algorithm, the theoretical analysis is quite technical. In this paper\nwe only describe the major theoretical result. Detailed analysis and proofs are given in the full\nversion of this work [16].\nTheorem 4 (Generalization vs. structure regularization) Let the structured prediction objective\nfunction of G be penalized by structure regularization with factor \u03b1 \u2208 [1, n] and L2 weight regular-\nization with factor \u03bb, and the penalized function has a minimizer f:\nL\u03c4 (g, zzz\n\n(8)\nAssume the point-wise loss \u2113\u03c4 is convex and differentiable, and is bounded by \u2113\u03c4 (f, zzz, k) \u2264 \u03b3.\nAssume f (xxx, k) is \u03c1-admissible. Let a local feature value be bounded by v such that xxx(k,q) \u2264 v for\nq \u2208 {1, . . . , d}. Then, for any \u03b4 \u2208 (0, 1), with probability at least 1 \u2212 \u03b4 over the random draw of\nthe training set S, the generalization risk R(f ) is bounded by\n\ng\u2208F R\u03b1,\u03bb(g) = argmin\ng\u2208F\n\nm\u03b1\u2211\n\nf = argmin\n\n||g||2\n\n\u2032\nj) +\n\n1\nmn\n\n(\n\n)\n\n\u03bb\n2\n\nj=1\n\n2\n\n(\n\n(4m \u2212 2)d\u03c4 2\u03c12v2n2\n\nm\u03bb\u03b1\n\n+ \u03b3\n\n)\u221a\n\nln \u03b4\u22121\n2m\n\n(9)\n\n(\n\nR(f ) \u2264 Re(f ) +\n)\u221a\n\n2d\u03c4 2\u03c12v2n2\n\nm\u03bb\u03b1\n\n+\n\nm\u03bb\u03b1\n\n+ \u03b3\n\n(4m\u22122)d\u03c4 2\u03c12v2n2\n\nThe proof is given in the full version of this work [16]. We call the term 2d\u03c4 2\u03c12v2n2\n\n+\nln \u03b4\u22121\n2m in (9) as \u201cover\ufb01t-bound\", and reducing the over\ufb01t-bound is cru-\ncial for reducing the generalization risk. Most importantly, we can see from the over\ufb01t-bound that\nthe structure regularization factor \u03b1 is always staying together with the weight regularization factor\n\u03bb, working together on reducing the over\ufb01t-bound. This indicates that the structure regularization is\nas important as the weight regularization for reducing the generalization risk.\nTheorem 4 also indicates that too simple structures may overkill the over\ufb01t-bound but with a domi-\nnating empirical risk, and too complex structures may overkill the empirical risk but with a dominat-\ning over\ufb01t-bound. Thus, to achieve the best prediction accuracy, a balanced complexity of structures\nshould be used for training the model.\n\nm\u03bb\u03b1\n\n2.3 Accelerating Convergence Rates in Training\n\nWe also analyze the impact on the convergence rate of online learning by applying structure regular-\nization. Following prior work [10], our analysis is based on the stochastic gradient descent (SGD)\nwith \ufb01xed learning rate. Let g(www) be the structured prediction objective function and www \u2208 W is the\nweight vector. Recall that the SGD update with \ufb01xed learning rate \u03b7 has a form like this:\n\nwwwt+1 \u2190 wwwt \u2212 \u03b7\u2207gzzzt (wwwt)\n\n\u2032\n\ng(www\n\n(10)\nwhere gzzz(wwwt) is the stochastic estimation of the objective function based on zzz which is randomly\ndrawn from S. To state our convergence rate analysis results, we need several assumptions following\n(Nemirovski et al. 2009). We assume g is strongly convex with modulus c, that is, \u2200www, www\n\n\u2032 \u2208 W,\n\n) \u2265 g(www) + (www\n\n\u2032 \u2212 www)T\u2207g(www) +\n\n\u2032 \u2212 www||2\n||www\nc\n(11)\n2\n\u2217. We also assume Lipschitz\nWhen g is strongly convex, there is a global optimum/minimizer www\ncontinuous differentiability of g with the constant q, that is, \u2200www, www\n\u2032 \u2208 W,\n\u2032 \u2212 www||\n(12)\nIt is also reasonable to assume that the norm of \u2207gzzz(www) has almost surely positive correlation with\nthe structure complexity of zzz,4 which can be quanti\ufb01ed by a bound \u03ba \u2208 R+:\n||\u2207gzzz(www)||2 \u2264 \u03ba|zzz| almost surely for \u2200www \u2208 W\nwhere |zzz| denotes the structure complexity of zzz. Moreover, it is reasonable to assume\n\n) \u2212 \u2207g(www)|| \u2264 q||www\n\n||\u2207g(www\n\n(13)\n\n\u2032\n\n(14)\nbecause even the ordinary gradient descent methods will diverge if \u03b7c > 1. Then, we show that\nstructure regularization can quadratically accelerate the SGD rates of convergence:\n\n\u03b7c < 1\n\n4Many models (e.g., CRFs) satisfy this assumption that the gradient of a larger sample is expected to have\n\na larger norm.\n\n5\n\n\fProposition 5 (Convergence rates vs. structure regularization) With the aforementioned as-\nsumptions, let the SGD training have a learning rate de\ufb01ned as \u03b7 = c\u03f5\u03b2\u03b12\nq\u03ba2n2 , where \u03f5 > 0 is a\nconvergence tolerance value and \u03b2 \u2208 (0, 1]. Let t be a integer satisfying\n\n(15)\nwhere n and \u03b1 \u2208 [1, n] is like before, and a0 is the initial distance which depends on the initialization\n\u2217||2. Then, after t updates of www it\nof the weights www0 and the minimizer www\nconverges to E[g(wwwt) \u2212 g(www\nThe proof is given in the full version of this work [16]. As we can see, using structure regularization\nwith the strength \u03b1 can quadratically accelerate the convergence rate with a factor of \u03b12.\n\n\u2217, i.e., a0 = ||www0 \u2212 www\n\n\u03f5\u03b2c2\u03b12\n\nt \u2265 q\u03ba2n2 log (qa0/\u03f5)\n\n\u2217\n\n)] \u2264 \u03f5.\n\n3 Experiments\n\nDiversi\ufb01ed Tasks. The natural language processing tasks include (1) part-of-speech tagging, (2)\nbiomedical named entity recognition, and (3) Chinese word segmentation. The signal processing\ntask is (4) sensor-based human activity recognition. The tasks (1) to (3) use boolean features and\nthe task (4) adopts real-valued features. From tasks (1) to (4), the averaged structure complexity\n(number of observations) n is very different, with n = 23.9, 26.5, 46.6, 67.9, respectively. The\ndimension of tags |Y| is also diversi\ufb01ed among tasks, with |Y| ranging from 5 to 45.\nPart-of-Speech Tagging (POS-Tagging). Part-of-Speech (POS) tagging is an important and highly\ncompetitive task. We use the standard benchmark dataset in prior work [3], with 38,219 training\nsamples and 5,462 test samples. Following prior work [22], we use features based on words and\nlexical patterns, with 393,741 raw features5. The evaluation metric is per-word accuracy.\nBiomedical Named Entity Recognition (Bio-NER). This task is from the BioNLP-2004 shared\ntask [22]. There are 17,484 training samples and 3,856 test samples. Following prior work [22],\nwe use word pattern features and POS features, with 403,192 raw features in total. The evaluation\nmetric is balanced F-score.\nWord Segmentation (Word-Seg). We use the MSR data provided by SIGHAN-2004 contest [4].\nThere are 86,918 training samples and 3,985 test samples. The features are similar to [18], with\n1,985,720 raw features in total. The evaluation metric is balanced F-score.\nSensor-based Human Activity Recognition (Act-Recog). This is a task based on real-valued sen-\nsor signals, with the data extracted from the Bao04 activity recognition dataset [17]. The features\nare similar to [17], with 1,228 raw features in total. There are 16,000 training samples and 4,000\ntest samples. The evaluation metric is accuracy.\nWe choose the CRFs [6] and structured perceptrons (Perc) [3], which are arguably the most popular\nprobabilistic and non-probabilistic structured prediction models, respectively. The CRFs are trained\nusing the SGD algorithm,6 and the baseline method is the traditional weight regularization scheme\n(WeightReg), which adopts the most representative L2 weight regularization, i.e., a Gaussian prior.\nWe also tested sparsity emphasized regularization methods, including L1 regularization and Group\nLasso regularization [9]. However, although the feature sparsity is improved, we \ufb01nd in experiments\nthat in most cases those sparsity emphasized regularization methods have lower accuracy than the\nL2 regularization. For the structured perceptrons, the baseline WeightAvg is the popular implicit\nregularization technique based on parameter averaging, i.e., averaged perceptron [3].\nThe rich edge features [19, 18] are employed for all methods. All methods are based on the 1st-\norder Markov dependency. For WeightReg, the L2 regularization strengths (i.e., \u03bb/2 in Eq.(8)) are\ntuned among values 0.1, 0.5, 1, 2, 5, and are determined on the development data (POS-Tagging) or\nsimply via 4-fold cross validation on the training set (Bio-NER, Word-Seg, and Act-Recog). With\nthis automatic tuning for WeightReg, we set 2, 5, 1 and 5 for POS-Tagging, Bio-NER, Word-Seg,\nand Act-Recog tasks, respectively.\n\n5Raw features are those observation features based only on xxx, i.e., no combination with tag information.\n6In theoretical analysis, following prior work we adopt the SGD with \ufb01xed learning rate, as described in\nSection 2.3. However, since the SGD with decaying learning rate is more commonly used in practice, in\nexperiments we use the SGD with decaying learning rate.\n\n6\n\n\fFigure 2: On the four tasks, comparing the structure regularization method (StructReg) with existing\nregularization methods in terms of accuracy/F-score. Row-1 shows the results on CRFs and Row-2\nshows the results on structured perceptrons.\n\nTable 1: Comparing our results with the benchmark systems on corresponding tasks.\nPOS-Tagging (Acc%) Bio-NER (F1%) Word-Seg (F1%)\n97.19 (see [4])\n\nBenchmark system\n\n72.28 (see [22])\n\n97.33 (see [15])\n\n72.43\n\n97.50\n\nOur results\n\n97.36\n\n3.1 Experimental Results\n\n\u2032\n\n\u2032\n\n\u2032 with n\n\n= n/\u03b1. Actually, n\n\nThe experimental results in terms of accuracy/F-score are shown in Figure 2. For the CRF model,\nthe training is convergent, and the results on the convergence state (decided by relative objective\nchange with the threshold value of 0.0001) are shown. For the structured perceptron model, the\ntraining is typically not convergent, and the results on the 10\u2019th iteration are shown. For stability of\nthe curves, the results of the structured perceptrons are averaged over 10 repeated runs.\nSince different samples have different size n in practice, we set \u03b1 being a function of n, so that the\n\u2032 is a probabilistic distri-\ngenerated mini-samples are with \ufb01xed size n\nbution because we adopt randomized decomposition. For example, if n\n= 5.5, it means the mini-\nsamples are a mixture of the ones with the size 5 and the ones with the size 6, and the mean of the\nsize distribution is 5.5. In the \ufb01gure, the curves are based on n\n= 1.5, 2.5, 3.5, 5.5, 10.5, 15.5, 20.5.\nAs we can see, the results are quite consistent. It demonstrates that structure regularization leads to\nhigher accuracies/F-scores compared with the existing baselines. We also conduct signi\ufb01cance tests\nbased on t-test. Since the t-test for F-score based tasks (Bio-NER and Word-Seg) may be unreli-\nable7, we only perform t-test for the accuracy-based tasks, i.e., POS-Tagging and Act-Recog. For\nPOS-Tagging, the signi\ufb01cance test suggests that the superiority of StructReg over WeightReg is very\nstatistically signi\ufb01cant, with p < 0.01. For Act-Recog, the signi\ufb01cance tests suggest that both the\nStructReg vs. WeightReg difference and the StructReg vs. WeightAvg difference are extremely statis-\ntically signi\ufb01cant, with p < 0.0001 in both cases. The experimental results support our theoretical\nanalysis that structure regularization can further reduce the generalization risk over existing weight\nregularization techniques.\nOur method outperforms the benchmark systems on the three important natural language processing\ntasks. The POS-Tagging task is a highly competitive task, with many methods proposed, and the best\nreport (without using extra resources) until now is achieved by using a bidirectional learning model\n\n\u2032\n\n7Indeed we can convert F-scores to accuracy scores for t-test, but in many cases this conversion is unreliable.\n\nFor example, very different F-scores may correspond to similar accuracy scores.\n\n7\n\n0510152097.197.1597.297.2597.397.3597.4POS\u2212Tagging: CRFMini\u2212Sample Size (n/\u03b1)Accuracy (%) StructRegWeightReg0510152071.87272.272.4Bio\u2212NER: CRFMini\u2212Sample Size (n/\u03b1)F\u2212score (%) StructRegWeightReg0510152097.497.4297.4497.4697.4897.5Word\u2212Seg: CRFMini\u2212Sample Size (n/\u03b1)F\u2212score (%) StructRegWeightReg0510152092.692.89393.293.493.6Act\u2212Recog: CRFMini\u2212Sample Size (n/\u03b1)Accuracy (%) StructRegWeightReg0510152097.197.1597.2POS\u2212Tagging: PercMini\u2212Sample Size (n/\u03b1)Accuracy (%) StructRegWeightAvg0510152071.271.471.671.872Bio\u2212NER: PercMini\u2212Sample Size (n/\u03b1)F\u2212score (%) StructRegWeightAvg0510152096.99797.197.297.3Word\u2212Seg: PercMini\u2212Sample Size (n/\u03b1)F\u2212score (%) StructRegWeightAvg0510152092.59393.5Act\u2212Recog: PercMini\u2212Sample Size (n/\u03b1)Accuracy (%) StructRegWeightAvg\fFigure 3: On the four tasks, comparing the structure regularization method (StructReg) with existing\nregularization methods in terms of wall-clock training time.\n\nin [15],8 with the accuracy 97.33%. Our simple method achieves better accuracy compared with all\nof those state-of-the-art systems. Furthermore, our method achieves as good scores as the benchmark\nsystems on the Bio-NER and Word-Seg tasks. On the Bio-NER task, [22] achieves 72.28% based\non lookahead learning and [24] achieves 72.65% based on reranking. On the Word-Seg task, [4]\nachieves 97.19% based on maximum entropy classi\ufb01cation and our recent work [18] achieves 97.5%\nbased on feature-frequency-adaptive online learning. The comparisons are summarized in Table 1.\nFigure 3 shows experimental comparisons in terms of wall-clock training time. As we can see, the\nproposed method can substantially improve the training speed. The speedup is not only from the\nfaster convergence rates, but also from the faster processing time on the structures, because it is\nmore ef\ufb01cient to process the decomposed samples with simple structures.\n\n4 Conclusions and Future Work\n\nWe proposed a structure regularization framework, which decomposes training samples into mini-\nsamples with simpler structures, deriving a trained model with regularized structural complexity.\nOur theoretical analysis showed that this method can effectively reduce the generalization risk, and\ncan also accelerate the convergence speed in training. The proposed method does not change the\nconvexity of the objective function, and can be used together with any existing weight regulariza-\ntion methods. The proposed method and the theoretical results can \ufb01t general structures including\nlinear chains, trees, and graphs. Experimental results demonstrated that our method achieved better\nresults than state-of-the-art systems on several highly-competitive tasks, and at the same time with\nsubstantially faster training speed.\nThe structure decomposition of structure regularization can naturally used for parallel training,\nachieving parallel training among mini-samples. As future work, we will combine structure reg-\nularization with parallel training.\n\nAcknowledgments\n\nThis work was supported in part by National Natural Science Foundation of China (No. 61300063), and\nDoctoral Fund of Ministry of Education of China (No. 20130001120004).\n\n8See a collection of the systems at http://aclweb.org/aclwiki/index.php?title=POS_\n\nTagging_(State_of_the_art)\n\n8\n\n051015200.511.522.5x 104POS\u2212Tagging: CRFMini\u2212Sample Size (n/\u03b1)Train\u2212time (sec) StructRegWeightReg0510152010002000300040005000Bio\u2212NER: CRFMini\u2212Sample Size (n/\u03b1)Train\u2212time (sec) StructRegWeightReg051015202000250030003500400045005000Word\u2212Seg: CRFMini\u2212Sample Size (n/\u03b1)Train\u2212time (sec) StructRegWeightReg0510152010002000300040005000Act\u2212Recog: CRFMini\u2212Sample Size (n/\u03b1)Train\u2212time (sec) StructRegWeightReg0510152040060080010001200POS\u2212Tagging: PercMini\u2212Sample Size (n/\u03b1)Train\u2212time (sec) StructRegWeightAvg05101520100150200250300350400450Bio\u2212NER: PercMini\u2212Sample Size (n/\u03b1)Train\u2212time (sec) StructRegWeightAvg05101520350400450Word\u2212Seg: PercMini\u2212Sample Size (n/\u03b1)Train\u2212time (sec) StructRegWeightAvg05101520100150200250300350Act\u2212Recog: PercMini\u2212Sample Size (n/\u03b1)Train\u2212time (sec) StructRegWeightAvg\fReferences\n[1] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying. A spectral regularization framework for multi-task\n\nstructure learning. In Proceedings of NIPS\u201907. MIT Press, 2007.\n\n[2] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research,\n\n2:499\u2013526, 2002.\n\n[3] M. Collins. Discriminative training methods for hidden markov models: Theory and experiments with\n\nperceptron algorithms. In Proceedings of EMNLP\u201902, pages 1\u20138, 2002.\n\n[4] J. Gao, G. Andrew, M. Johnson, and K. Toutanova. A comparative study of parameter estimation methods\n\nfor statistical natural language processing. In Proceedings of ACL\u201907, pages 824\u2013831, 2007.\n\n[5] J. Gra\u00e7a, K. Ganchev, B. Taskar, and F. Pereira. Posterior vs parameter sparsity in latent variable models.\n\nIn Proceedings of NIPS\u201909, pages 664\u2013672, 2009.\n\n[6] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data. In ICML\u201901, pages 282\u2013289, 2001.\n\n[7] B. London, B. Huang, B. Taskar, and L. Getoor. Collective stability in structured prediction: General-\nization from one example. In Proceedings of the 30th International Conference on Machine Learning\n(ICML-13), pages 828\u2013836, 2013.\n\n[8] B. London, B. Huang, B. Taskar, and L. Getoor. Pac-bayes generalization bounds for randomized struc-\n\ntured prediction. In NIPS Workshop on Perturbation, Optimization and Statistics, 2013.\n\n[9] A. F. T. Martins, N. A. Smith, M. A. T. Figueiredo, and P. M. Q. Aguiar. Structured sparsity in structured\n\nprediction. In Proceedings of EMNLP\u201911, pages 1500\u20131511, 2011.\n\n[10] F. Niu, B. Recht, C. Re, and S. J. Wright. Hogwild: A lock-free approach to parallelizing stochastic\n\ngradient descent. In NIPS\u201911, pages 693\u2013701, 2011.\n\n[11] A. Quattoni, X. Carreras, M. Collins, and T. Darrell. An ef\ufb01cient projection for l1,in\ufb01nity regularization.\n\nIn Proceedings of ICML\u201909, page 108, 2009.\n\n[12] R. Samdani and D. Roth. Ef\ufb01cient decomposed learning for structured prediction. In ICML\u201912, 2012.\n[13] M. W. Schmidt and K. P. Murphy. Convex structure learning in log-linear models: Beyond pairwise\n\npotentials. In Proceedings of AISTATS\u201910, volume 9 of JMLR Proceedings, pages 709\u2013716, 2010.\n\n[14] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability and stability in the general\n\nlearning setting. In Proceedings of COLT\u201909, 2009.\n\n[15] L. Shen, G. Satta, and A. K. Joshi. Guided learning for bidirectional sequence classi\ufb01cation. In Proceed-\n\nings of ACL\u201907, 2007.\n\n[16] X. Sun. Structure regularization for structured prediction: Theories and experiments. In Technical report,\n\narXiv, 2014.\n\n[17] X. Sun, H. Kashima, and N. Ueda. Large-scale personalized human activity recognition using online\n\nmultitask learning. IEEE Trans. Knowl. Data Eng., 25(11):2551\u20132563, 2013.\n\n[18] X. Sun, W. Li, H. Wang, and Q. Lu. Feature-frequency-adaptive on-line training for fast and accurate\n\nnatural language processing. Computational Linguistics, 40(3):563\u2013586, 2014.\n\n[19] X. Sun, H. Wang, and W. Li. Fast online training with frequency-adaptive learning rates for chinese word\n\nsegmentation and new word detection. In Proceedings of ACL\u201912, pages 253\u2013262, 2012.\n\n[20] C. A. Sutton and A. McCallum. Piecewise pseudolikelihood for ef\ufb01cient training of conditional random\n\n\ufb01elds. In ICML\u201907, pages 863\u2013870. ACM, 2007.\n\n[21] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS\u201903, 2003.\n[22] Y. Tsuruoka, Y. Miyao, and J. Kazama. Learning with lookahead: Can history-based models rival globally\n\noptimized models? In Conference on Computational Natural Language Learning, 2011.\n\n[23] H. Xue, S. Chen, and Q. Yang. Structural regularized support vector machine: A framework for structural\n\nlarge margin classi\ufb01er. IEEE Transactions on Neural Networks, 22(4):573\u2013587, 2011.\n\n[24] K. Yoshida and J. Tsujii. Reranking for biomedical named-entity recognition.\n\nBioNLP, pages 209\u2013216, 2007.\n\nIn ACL Workshop on\n\n[25] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society, Series B, 68:49\u201367, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1251, "authors": [{"given_name": "Xu", "family_name": "Sun", "institution": "Peking University"}]}