{"title": "Efficient multiple hyperparameter learning for log-linear models", "book": "Advances in Neural Information Processing Systems", "page_first": 377, "page_last": 384, "abstract": "Using multiple regularization hyperparameters is an effective method for managing model complexity in problems where input features have varying amounts of noise. While algorithms for choosing multiple hyperparameters are often used in neural networks and support vector machines, they are not common in structured prediction tasks, such as sequence labeling or parsing. In this paper, we consider the problem of learning regularization hyperparameters for log-linear models, a class of probabilistic models for structured prediction tasks which includes conditional random fields (CRFs). Using an implicit differentiation trick, we derive an efficient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. In both simulations and the real-world task of computational RNA secondary structure prediction, we find that multiple hyperparameter learning provides a significant boost in accuracy compared to models learned using only a single regularization hyperparameter.", "full_text": "Ef\ufb01cient multiple hyperparameter\n\nlearning for log-linear models\n\nChuong B. Do\n\nChuan-Sheng Foo\n\nComputer Science Department\n\nStanford University\nStanford, CA 94305\n\nAndrew Y. Ng\n\n{chuongdo,csfoo,ang}@cs.stanford.edu\n\nAbstract\n\nIn problems where input features have varying amounts of noise, using distinct\nregularization hyperparameters for different features provides an effective means\nof managing model complexity. While regularizers for neural networks and sup-\nport vector machines often rely on multiple hyperparameters, regularizers for\nstructured prediction models (used in tasks such as sequence labeling or pars-\ning) typically rely only on a single shared hyperparameter for all features. In this\npaper, we consider the problem of choosing regularization hyperparameters for\nlog-linear models, a class of structured prediction probabilistic models which in-\ncludes conditional random \ufb01elds (CRFs). Using an implicit differentiation trick,\nwe derive an ef\ufb01cient gradient-based method for learning Gaussian regularization\npriors with multiple hyperparameters. In both simulations and the real-world task\nof computational RNA secondary structure prediction, we \ufb01nd that multiple hy-\nperparameter learning can provide a signi\ufb01cant boost in accuracy compared to\nusing only a single regularization hyperparameter.\n\n1 Introduction\nIn many supervised learning methods, over\ufb01tting is controlled through the use of regularization\npenalties for limiting model complexity. The effectiveness of penalty-based regularization for a\ngiven learning task depends not only on the type of regularization penalty used (e.g., L1 vs L2) [29]\nbut also (and perhaps even more importantly) on the choice of hyperparameters governing the regu-\nlarization penalty (e.g., the hyperparameter \u03bb in an isotropic Gaussian parameter prior, \u03bb||w||2).\n\nWhen only a single hyperparameter must be tuned, cross-validation provides a simple yet reliable\nprocedure for hyperparameter selection. For example, the regularization hyperparameter C in a\nsupport vector machine (SVM) is usually tuned by training the SVM with several different values\nof C, and selecting the one that achieves the best performance on a holdout set. In many situations,\nusing multiple hyperparameters gives the distinct advantage of allowing models with features of\nvarying strength; for instance, in a natural language processing (NLP) task, features based on word\nbigrams are typically noisier than those based on individual word occurrences, and hence should\nbe \u201cmore regularized\u201d to prevent over\ufb01tting. Unfortunately, for sophisticated models with multiple\nhyperparameters [23], the na\u00a8\u0131ve grid search strategy of directly trying out possible combinations of\nhyperparameter settings quickly grows infeasible as the number of hyperparameters becomes large.\nScalable strategies for cross-validation\u2013based hyperparameter learning that rely on computing\nthe gradient of cross-validation loss with respect to the desired hyperparameters arose \ufb01rst in the\nneural network modeling community [20, 21, 1, 12]. More recently, similar cross-validation opti-\nmization techniques have been proposed for other supervised learning models [3], including sup-\nport vector machines [4, 10, 16], Gaussian processes [35, 33], and related kernel learning meth-\nods [18, 17, 39]. Here, we consider the problem of hyperparameter learning for a specialized class\nof structured classi\ufb01cation models known as conditional log-linear models (CLLMs), a generaliza-\ntion of conditional random \ufb01elds (CRFs) [19].\n\n\fWhereas standard binary classi\ufb01cation involves mapping an object x \u2208 X to some binary output\ny \u2208 Y (where Y = {\u00b11}), the input space X and output space Y in a structured classi\ufb01cation task\ngenerally contain complex combinatorial objects (such as sequences, trees, or matchings). Design-\ning hyperparameter learning algorithms for structured classi\ufb01cation models thus yields a number of\nunique computational challenges not normally encountered in the \ufb02at classi\ufb01cation setting. In this\npaper, we derive a gradient-based approach for optimizing the hyperparameters of a CLLM using the\nloss incurred on a holdout set. We describe the required algorithms speci\ufb01c to CLLMs which make\nthe needed computations tractable. Finally, we demonstrate on both simulations and a real-world\ncomputational biology task that our hyperparameter learning method can give gains over learning\n\ufb02at unstructured regularization priors.\n\n2 Preliminaries\nConditional log-linear models (CLLMs) are a probabilistic framework for sequence labeling or pars-\ning problems, where X is an exponentially large space of possible input sequences and Y is an\nexponentially large space of candidate label sequences or parse trees. Let F : X \u00d7 Y \u2192 Rn be\na \ufb01xed vector-valued mapping from input-output pairs to an n-dimensional feature space. CLLMs\nmodel the conditional probability of y given x as P (y | x; w) = exp(wT F(x, y))/Z(x) where\nof i.i.d. labeled input-\noutput pairs drawn from some unknown \ufb01xed distribution D over X \u00d7 Y, the parameter learning\nproblem is typically posed as maximum a posteriori (MAP) estimation (or equivalently, regularized\nlogloss minimization):\n\nZ(x) =Py\u2032\u2208Y exp(wT F(x, y\u2032)). Given a training set T =(cid:8)(x(i), y(i))(cid:9)m\nlog P (y(i) | x(i); w)!,\n\nw\u2208Rn 1\n\nw\u22c6 = arg min\n\nwT Cw \u2212\n\n(OPT1)\n\ni=1\n\n2\n\nm\n\nwhere 1\nover\ufb01tting. Here, C is the inverse covariance matrix of a Gaussian prior on the parameters w.\n\n2 wT Cw (for some positive de\ufb01nite matrix C) is a regularization penalty used to prevent\n\nWhile a number of ef\ufb01cient procedures exist for solving the optimization problem OPT1 [34, 11],\nlittle attention is usually given to choosing an appropriate regularization matrix C. Generally, C is\nparameterized using a small number of free variables, d \u2208 Rk, known as the hyperparameters of the\nof i.i.d. examples drawn from D, hyperparameter\n\nXi=1\n\n(OPT2)\n\nlearning itself can be cast as an optimization problem:\n\nmodel. Given a holdout set H =(cid:8)(\u02dcx(i), \u02dcy(i))(cid:9) \u02dcm\nXi=1\n\nminimize\n\nd\u2208Rk\n\ni=1\n\n\u2212\n\n\u02dcm\n\nlog P(cid:16)\u02dcy(i) | \u02dcx(i); w\u22c6(C)(cid:17).\n\nIn words, OPT2 \ufb01nds the hyperparameters d whose regularization matrix C leads the parameter\nvector w\u22c6(C) learned from the training set to obtain small logloss on holdout data. For many real-\nworld applications, C is assumed to take a simple form, such as a scaled identity matrix, CI. While\nthis parameterization may be partially motivated by concerns of hyperparameter over\ufb01tting [28],\nsuch a choice usually stems from the dif\ufb01culty of hyperparameter inference.\n\nIn practice, grid-search procedures provide a reliable method for determining hyperparam-\neters to low-precision: one trains the model using several candidate values of C (e.g., C \u2208\n\n(cid:8). . . , 2\u22122, 2\u22121, 20, 21, 22, . . .(cid:9)), and chooses the C that minimizes holdout logloss. While this strat-\n\negy is suitable for tuning a single model hyperparameter, more sophisticated strategies are necessary\nwhen optimizing multiple hyperparameters.\n\n3 Learning multiple hyperparameters\nIn this section, we lay the framework for multiple hyperparameter learning by describing a simple\nyet \ufb02exible parameterization of C that arises quite naturally in many practical problems. We then\ndescribe a generic strategy for hyperparameter adaptation via gradient-based optimization.\n\nConsider a setting in which prede\ufb01ned subsets of parameter components (which we call reg-\nularization groups) are constrained to use the same hyperparameters [6]. For instance, in an\nNLP task, individual word occurrence features may be placed in a separate regularization group\nfrom word bigram features. Formally, let k be a \ufb01xed number of regularization groups, and let\n\u03c0 : {1, . . . , n} \u2192 {1, . . . , k} be a prespeci\ufb01ed mapping from parameters to regularization groups.\nFurthermore, for a vector x \u2208 Rk, de\ufb01ne its expansion x \u2208 Rn as x = (x\u03c0(1), x\u03c0(2), . . . , x\u03c0(n)).\n\nIn the sequel, we parameterize C \u2208 Rn\u00d7n in terms of some hyperparameter vector d \u2208 Rk\nas the diagonal matrix, C(d) = diag(exp(d)). Under this representation, C(d) is necessar-\n\n\fily positive de\ufb01nite, so OPT2 can be written as an unconstrained minimization over the variables\n\nd \u2208 Rk. Speci\ufb01cally, let \u2113T (w) = \u2212Pm\n\u2113H (w) = \u2212P \u02dcm\n\ni=1 log P(cid:0)y(i) | x(i); w(cid:1) denote the training logloss and\ni=1 log P(cid:0)\u02dcy(i) | \u02dcx(i); w(cid:1) the holdout logloss for a parameter vector w. Omitting the\n\ndependence of C on d for notational convenience, we have the optimization problem\n\nminimize\n\nd\u2208Rk\n\n\u2113H (w\u22c6)\n\nsubject to w\u22c6 = arg min\n\nw\u2208Rn (cid:18) 1\n\n2\n\nwT Cw + \u2113T (w)(cid:19).\n\n(OPT2\u2019)\n\nFor any \ufb01xed setting of these hyperparameters, the objective function of OPT2\u2019 can be evaluated by\n(1) using the hyperparameters d to determine the regularization matrix C, (2) solving OPT1 using\nC to determine w\u22c6 and (3) computing the holdout logloss using the parameters w\u22c6. In this next\nsection, we derive a method for computing the gradient of the objective function of OPT2\u2019 with\nrespect to the hyperparameters. Given both procedures for function and gradient evaluation, we may\napply standard gradient-based optimization (e.g., conjugate gradient or L-BFGS [30]) in order to\n\ufb01nd a local optimum of the objective. In general, we observe that only a few iterations (\u223c 5) are\nusually suf\ufb01cient to determine reasonable hyperparameters to low accuracy.\n\n4 The hyperparameter gradient\nNote that the optimization objective \u2113H (w\u22c6) is a function of w\u22c6. In turn, w\u22c6 is a function of the hy-\nperparameters d, as implicitly de\ufb01ned by the gradient stationarity condition, Cw\u22c6 + \u2207w\u2113T (w\u22c6) =\n0. To compute the hyperparameter gradient, we will use both of these facts.\n\n4.1 Deriving the hyperparameter gradient\nFirst, we apply the chain rule to the objective function of OPT2\u2019 to obtain\n\n(1)\nwhere Jd is the n \u00d7 k Jacobian matrix whose (i, j)th entry is \u2202w\u22c6\ni /\u2202dj. The term \u2207w\u2113H (w\u22c6) is\nsimply the gradient of the holdout logloss evaluated at w\u22c6. For decomposable models, this may\nbe computed exactly via dynamic programming (e.g., the forward/backward algorithm for chain-\nstructured models or the inside/outside algorithm for grammar-based models).\n\n\u2207d\u2113H (w\u22c6) = JT\n\nd \u2207w\u2113H (w\u22c6)\n\nNext, we show how to compute the Jacobian matrix Jd. Recall that at the optimum of the smooth\nunconstrained optimization problem OPT1, the partial derivative of the objective with respect to any\nparameter must vanish. In particular, the partial derivative of 1\n2 wT Cw + \u2113T (w) with respect to wi\nvanishes when w = w\u22c6, so\n\n0 = CT\n\ni w\u22c6 +\n\n\u2202\n\n\u2202wi\n\n\u2113T (w\u22c6),\n\n(2)\n\nwhere CT\ni denotes the ith row of the C matrix. Since (2) uniquely de\ufb01nes w\u22c6 (as OPT1 is a\nstrictly convex optimization problem), we can use implicit differentiation to obtain the needed partial\nderivatives. Speci\ufb01cally, we can differentiate both sides of (2) with respect to dj to obtain\n\n(3)\n\n(4)\n\n0 =\n\nn\n\nXp=1(cid:18)w\u22c6\n\np\n\n\u2202\n\u2202dj\n\nCip + Cip\n\n= I{\u03c0(i)=j}w\u22c6\n\ni exp(dj) +\n\nn\n\nXp=1\n\n\u2202\n\nw\u22c6\n\n\u2202\n\u2202dj\n\np(cid:19) +\nXp=1(cid:18)Cip +\n\nn\n\n\u2202\n\n\u2202\n\n\u2202wp\n\n\u2202wi\n\n\u2113T (w\u22c6)\n\n\u2202\n\u2202dj\n\nw\u22c6\np,\n\n\u2202wp\n\n\u2202wi\n\n\u2202\n\n\u2113T (w\u22c6)(cid:19) \u2202\n\n\u2202dj\n\nw\u22c6\np.\n\nStacking (4) for all i \u2208 {1, . . . , n} and j \u2208 {1, . . . , k}, we obtain the equivalent matrix equation,\n(5)\nw\u2113T (w\u22c6) is the\n\nwhere B is the n \u00d7 k matrix whose (i, j)th element is I{\u03c0(i)=j}w\u22c6\nHessian of the training logloss evaluated at w\u22c6. Finally, solving these equations for Jd, we obtain\n\ni exp(dj), and \u22072\n\n0 = B + (C + \u22072\n\nw\u2113T (w\u22c6))Jd\n\nJd = \u2212(C + \u22072\n\nw\u2113T (w\u22c6))\u22121B.\n\n(6)\n\n4.2 Computing the hyperparameter gradient ef\ufb01ciently\nIn principle, one could simply use (6) to obtain the Jacobian matrix Jd directly. However, computing\nthe n \u00d7 n matrix (C + \u22072\nw\u2113T (w\u22c6) in\na typical CLLM requires approximately n times the cost of a single logloss gradient evaluation.\nOnce the Hessian has been computed, typical matrix inversion routines take O(n3) time. Even\nmore problematic, the \u2126(n2) memory usage for storing the Hessian is prohibitive as typical log-\nlinear models (e.g., in NLP) may have thousands or even millions of features. To deal with these\n\nw\u2113T (w\u22c6))\u22121 is dif\ufb01cult. Computing the Hessian matrix \u22072\n\n\fAlgorithm 1: Gradient computation for hyperparameter selection.\n\nInput:\n\nOutput:\n\ntraining set T =(cid:8)(x(i), y(i))(cid:9)m\n\ni=1\ncurrent hyperparameters d \u2208 Rk\nhyperparameter gradient \u2207d\u2113H (w\u22c6)\n\n, holdout set H =(cid:8)(\u02dcx(i), \u02dcy(i))(cid:9) \u02dcm\n\ni=1\n\n1. Compute solution w\u22c6 to OPT1 using regularization matrix C = diag(exp(d)).\n2. Form the matrix B \u2208 Rn\u00d7k such that (B)ij = I{\u03c0(i)=j}w\u22c6\n3. Use conjugate gradient algorithm to solve the linear system,\n\ni exp(dj).\n\n(C + \u22072\n\nw\u2113T (w\u22c6))x = \u2207w\u2113H (w\u22c6).\n\n4. Return \u2212BT x.\n\nFigure 1: Pseudocode for gradient computation\n\nw\u2113T (w\u22c6))v for any arbitrary vector v \u2208 Rn can be computed\nproblems, we \ufb01rst explain why (C+\u22072\nin O(n) time, even though forming (C + \u22072\nbw\u2113T (w\u22c6))\u22121 is expensive. Using this result, we then\ndescribe an ef\ufb01cient procedure for computing the holdout hyperparameter gradient which avoids the\nexpensive Hessian computation and inversion steps of the direct method.\n\nFirst, since C is diagonal, the product of C with any arbitrary vector v is trivially computable in\nO(n) time. Second, although direct computation of the Hessian is inef\ufb01cient in a generic log-linear\nmodel, computing the product of the Hessian with v can be done quickly, using any of the following\ntechniques, listed in order of increasing implementation effort (and numerical precision):\n\n1. Finite differencing. Use the following numerical approximation:\n\n\u22072\n\nw\u2113T (w\u22c6) \u00b7 v = lim\nr\u21920\n\n\u2207w\u2113T (w\u22c6 + rv) \u2212 \u2207w\u2113t(w\u22c6)\n\nr\n\n.\n\n2. Complex step derivative [24]. Use the following identity from complex analysis:\n\n\u22072\n\nw\u2113T (w\u22c6) \u00b7 v = lim\nr\u21920\n\nIm {\u2207w\u2113T (w\u22c6 + i \u00b7 rv)}\n\n.\n\nr\n\n(7)\n\n(8)\n\nwhere Im {\u00b7} denotes the imaginary part of its complex argument (in this case, a vector).\nBecause there is no subtraction in the numerator of the right-hand expression, the complex-\nstep derivative does not suffer from the numerical problems of the \ufb01nite-differencing\nmethod that result from cancellation. As a consequence, much smaller step sizes can be\nused, allowing for greater accuracy.\n\n3. Analytical computation. Given an existing O(n) algorithm for computing gradients ana-\n\nlytically, de\ufb01ne the differential operator\n\nf (w + rv) \u2212 f (w)\n\nRv{f (w)} = lim\nr\u21920\n\n\u2202\n\u2202r\nfor which one can verify that Rv{\u2207w\u2113T (w\u22c6)} = \u22072\nw\u2113T (w\u22c6) \u00b7 v. By applying stan-\ndard rules for differential operators, Rv{\u2207w\u2113T (w\u22c6)} can be computed recursively using\na modi\ufb01ed version of the original gradient computation routine; see [31] for details.\n\nf (w + rv)(cid:12)(cid:12)(cid:12)(cid:12)r=0\n\n(9)\n\n=\n\nr\n\n,\n\nHessian-vector products for graphical models were previously used in the context of step-size adap-\ntation for stochastic gradient descent [36]. In our experiments, we found that the simplest method,\n\ufb01nite-differencing, provided suf\ufb01cient accuracy for our application.\n\nGiven the above procedure for computing matrix-vector products, we can now use the conjugate\ngradient (CG) method to solve the matrix equation (5) to obtain Jd. Unlike direct methods for\nsolving linear systems Ax = b, CG is an iterative method which relies on the matrix A only\nthrough matrix-vector products Av. In practice, few steps of the CG algorithm are generally needed\nto \ufb01nd an approximate solution of a linear system with acceptable accuracy. Using CG in this\nway amounts to solving k linear systems, one for each column of the Jd matrix. Unlike the direct\nmethod of forming the (C + \u22072\nw\u2113T (w\u22c6)) matrix and its inverse, solving the linear systems avoids\nthe expensive \u2126(n2) cost of Hessian computation and matrix inversion.\n\nNevertheless, even this approach for computing the Jacobian matrices still requires the solution\nof multiple linear systems, which scales poorly when the number of hyperparameters k is large.\n\n\f(a)\n\ny1\n\nxj\n1\n\ny2\n\nxj\n2\n\n\u00b7 \u00b7 \u00b7\n\n\u00b7 \u00b7 \u00b7\n\nyL\n\nxj\nL\n\n\u201cobserved features\u201d\n\nj \u2208 {1, . . . , R}\n\nxj\n1\n\nxj\n2\n\n\u00b7 \u00b7 \u00b7\n\nxj\nLxj\nL\n\n\u201cnoise features\u201d\n\nj \u2208 {R + 1, . . . , 40}\n\ngrid\nsingle\nseparate\ngrouped\n\n(b)\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\nl\n\ns\ne\nb\na\n\nl\n \nt\nc\ne\nr\nr\no\nc\nn\n\ni\n \nf\n\no\n\n \n\nn\no\n\ni\nt\nr\no\np\no\nr\nP\n\n0.1\n\n10\n\n0\n40\nNumber of relevant features, R\n\n20\n\n30\n\n(c)\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\nl\n\ns\ne\nb\na\n\nl\n \nt\nc\ne\nr\nr\no\nc\nn\n\ni\n \nf\n\no\n\n \n\nn\no\n\ni\nt\nr\no\np\no\nr\nP\n\n0.3\n\n0\n\ngrid\nsingle\nseparate\ngrouped\n\n20\n\n40\n\n60\n\nTraining set size, M\n\n80\n\nFigure 2: HMM simulation experiments. (a) State diagram of the HMM used in the simulations. (b)\nTesting set performance when varying R, using M = 10. (c) Testing set performance when varying\nM, using R = 5. In both (b) and (c), each point represents an average over 100 independent runs of\nHMM training/holdout/testing set generation and CRF training and hyperparameter optimization.\n\nHowever, we can do much better by reorganizing the computations in such a way that the Jacobian\nmatrix Jd is never explicitly required. In particular, substituting (6) into (1),\nw\u2113T (w\u22c6))\u22121\u2207w\u2113H (w\u22c6)\n\n\u2207d\u2113H (w\u22c6) = \u2212BT (C + \u22072\n\n(10)\n\nwe observe that it suf\ufb01ces to solve the single linear system,\n\n(C + \u22072\n\n(11)\nand then form \u2207d\u2113H (w\u22c6) = \u2212BT x. By organizing the computations this way, the number of least\nsquares problems that must be solved is substantially reduced from k to only one. A similar trick\nwas previously used for hyperparameter adaptation in SVMs [16] and kernel logistic regression [33].\nFigure 1 shows a summary of our algorithm for hyperparameter gradient computation.1\n\nw\u2113T (w\u22c6))x = \u2207w\u2113H (w\u22c6)\n\n5 Experiments\nTo test the effectiveness of our hyperparameter learning algorithm, we applied it to two tasks: a sim-\nulated sequence labeling task involving noisy features, and a real-world application of conditional\nlog-linear models to the biological problem of RNA secondary structure prediction.\n\nSequence labeling simulation. For our simulation test, we constructed a simple linear-chain\nhidden Markov model (HMM) with binary-valued hidden nodes, yi \u2208 {0, 1}.2 We associated 40\nbinary-valued features xj\ni , j \u2208 {1, . . . , 40} with each hidden state yi, including R \u201crelevant\u201d ob-\nserved features whose values were chosen based on yi, and (40 \u2212 R) \u201cirrelevant\u201d noise features\nwhose values were chosen to be either 0 or 1 with equal probability, independent of yi.3 Figure 2a\nshows the graphical model representing the HMM. For each run, we used the HMM to simulate\ntraining, holdout, and testing sets of M, 10, and 1000 sequences, respectively, each of length 10.\n\nNext, we constructed a CRF based on an HMM model similar to that shown in Figure 2a in\nwhich potentials were included for the initial node y1, between each yi and yi+1, and between\nyi and each xj\ni (including both the observed features and the noise features). We then performed\ngradient-based hyperparameter learning using three different parameter-tying schemes: (a) all hy-\nperparameters constrained to be equal, (b) separate hyperparameter groups for each parameter of the\nmodel, and (c) transitions, observed features, and noise features each grouped together. Figure 2b\nshows the performance of the CRF for each of the three parameter-tying gradient-based optimization\nschemes, as well as the performance of scheme (a) when using the standard grid-search strategy of\n\ntrying regularization matrices CI for C \u2208(cid:8). . . , 2\u22122, 2\u22121, 20, 21, 22, . . .(cid:9).\n\nAs seen in Figures 2b and 2c, the gradient-based procedure performed either as well as or bet-\nter than a grid search for single hyperparameter models. Using either a single hyperparameter or\nall separate hyperparameters generally gave similar results, with a slight tendency for the separate\n\n1In practice, roughly 50-100 iterations of CG were suf\ufb01cient to obtain hyperparameter gradients, meaning\nthat the cost of running Algorithm 1 was approximately the same as the cost of solving OPT1 for a single \ufb01xed\nsetting of the hyperparameters. Roughly 3-5 line searches were suf\ufb01cient to identify good hyperparameter\nsettings; assuming that each line search takes 2-4 times the cost of solving OPT1, the overall hyperparameter\nlearning procedure takes approximately 20 times the cost of solving OPT1 once.\n\n2For our HMM, we set initial state probabilities to 0.5 each, and used self-transition probabilities of 0.6.\n3Speci\ufb01cally, we drew each xj\n\ni independently according to P (xj\n\ni = v | yi = v) = 0.6, v \u2208 {0, 1}.\n\n\f(a)\n\n(b)\n\nuccguagaaggc\n\n5\u2019\n\n3\u2019\n\nRNA sequence\n\nRegularization group\n\nhairpin loop lengths\nhelix closing base pairs\nsymmetric internal loop lengths\nexternal loop lengths\nbulge loop lengths\nbase pairings\ninternal loop asymmetry\nexplicit internal loop sizes\nterminal mismatch interactions\nsingle base pair stacking interactions\n1 \u00d7 1 internal loop nucleotides\nsingle base bulge nucleotides\ninternal loop lengths\nmulti-branch loop lengths\nhelix stacking interactions\n\na .\n\ng\n\n.\na\n.\na\n.\n\n.\nu\n.\ng\n.\n\n|\n\n|\n\n|\n\n.c\n.c\n.u\n5\u2019\n\ng\n.\ng\n.\n.a\n3\u2019\nsecondary\nstructure\n\nexp(di)\n\nfold A\n0.0832\n0.780\n6.32\n0.338\n0.451\n2.01\n4.24\n12.8\n132\n71.0\n139\n136\n1990\n359\n12100\n\nfold B\n0.456\n0.0947\n0.0151\n0.401\n2.03\n7.95\n6.90\n6.39\n50.2\n104\n120.\n130.\n35.3\n2750\n729\n\n(c)\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\ny\nt\ni\nv\ni\nt\ni\ns\nn\ne\nS\n\n0.55\n\nILM\n\nCONTRAfold (our algorithm)\n\nMfold\n\nViennaRNA\n\nPKNOTS\n\n0.5\n\n0.45\n\n0.4\n\nsingle (AUC=0.6169, logloss=5916)\nseparate (AUC=0.6383, logloss=5763)\ngrouped (AUC=0.6406, logloss=5531)\n\nPfold\n\n0.45\n\n0.5\n\n0.55\n\n0.6\n\nSpecificity\n\n0.65\n\n0.7\n\n0.75\n\n0.8\n\nFigure 3: RNA secondary structure prediction. (a) An illustration of the secondary structure predic-\ntion task. (b) Grouped hyperparameters learned using our algorithm for each of the two folds. (c)\nPerformance comparison with state-of-the-art methods when using either a single hyperparameter\n(the \u201coriginal\u201d CONTRAfold), separate hyperparameters, or grouped hyperparameters.\n\nhyperparameter model to over\ufb01t. Enforcing regularization groups, however, gave consistently lower\nerror rates, achieving an absolute reduction in generalization error over the next-best model of 6.7%,\ncorresponding to a relative reduction of 16.2%.\n\nRNA secondary structure prediction. We also applied our framework to the problem of RNA\nsecondary structure prediction. Ribonucleic acid (RNA) molecules are long nucleic acid polymers\npresent in the cells of all living organisms. For many types of RNA, three-dimensional (or tertiary)\nstructure plays an important role in determining the RNA\u2019s function. Here, we focus on the task\nof predicting RNA secondary structure, i.e., the pattern of nucleotide base pairings which form the\ntwo-dimensional scaffold upon which RNA tertiary structures assemble (see Figure 3a).\n\nAs a starting point, we used CONTRAfold [7], a current state-of-the-art secondary structure\nprediction program based on CLLMs. In brief, the CONTRAfold program models RNA secondary\nstructures using a variant of stochastic context-free grammars (SCFGs) which incorporates features\nchosen to closely match the energetic terms found in standard physics-based models of RNA struc-\nture. These features model the various types of loops that occur in RNAs (e.g., hairpin loops, bulge\nloops, interior loops, etc.). To control over\ufb01tting, CONTRAfold uses \ufb02at L2 regularization. Here,\nwe modi\ufb01ed the existing implementation to perform an \u201couter\u201d optimization loop based on our al-\ngorithm, and chose regularization groups either by (a) enforcing a single hyperparameter group, (b)\nusing separate groups for each parameter, or (c) grouping according to the type of each feature (e.g.,\nall features for describing hairpin loop lengths were placed in a single regularization group).\n\nFor testing, we collected 151 RNA sequences from the Rfam database [13] for which\nexperimentally-determined secondary structures were already known. We divided this dataset into\ntwo folds (denoted A and B) and performed two-fold cross-validation. Despite the small size of\nthe training set, the hyperparameters learned on each fold were nonetheless qualitatively similar,\nindicating the robustness of the procedure (see Figure 3b). As expected, features with small regular-\nization hyperparameters correspond to properties of RNAs which are known to contribute strongly\nto the energetics of RNA secondary structure, whereas many of the features with larger regulariza-\ntion hyperparameters indicate structural properties whose presence/absence are either less correlated\nwith RNA secondary structure or suf\ufb01ciently noisy that their parameters are dif\ufb01cult to determine\nreliably from the training data.\n\n\fWe then compared the cross-validated performance of algorithm with state-of-the-art methods\n(see Figure 3c).4 Using separate or grouped hyperparameters both gave increased sensitivity and\nincreased speci\ufb01city compared to the original model, which was learned using a single regulariza-\ntion hyperparameter. Overall, the testing logloss (summed over the two folds) decreased by roughly\n6.5% when using grouped hyperparameters and 2.6% when using multiple separate hyperparame-\nters, while the estimated testing ROC area increased by roughly 3.8% and 3.4%, respectively.\n\n6 Discussion and related work\nIn this work, we presented a gradient-based approach for hyperparameter learning based on mini-\nmizing logloss on a holdout set. While the use of cross-validation loss as a proxy for generalization\nerror is fairly natural, in many other supervised learning methods besides log-linear models, other\nobjective functions have been proposed for hyperparameter optimization.\nIn SVMs, approaches\nbased on optimizing generalization bounds [4], such as the radius/margin-bound [15] or maximal\ndiscrepancy criterion [2] have been proposed. Comparable generalization bounds are not generally\nknown for CRFs; even in SVMs, however, generalization bound-based methods empirically do not\noutperform simpler methods based on optimizing \ufb01ve-fold cross-validation error [8].\n\nA different method for dealing with hyperparameters, common in neural network modeling, is\nthe Bayesian approach of treating hyperparameters themselves as parameters in the model to be es-\ntimated. In an ideal Bayesian scheme, one does not perform hyperparameter or parameter inference,\nbut rather integrates over all possible hyperparameters and parameters in order to obtain a posterior\ndistribution over predicted outputs given the training data. This integration can be performed using\na hybrid Monte Carlo strategy [27, 38]. For the types of large-scale log-linear models we consider in\nthis paper, however, the computational expense of sampling-based strategies can be extremely high\ndue to slow convergence of MCMC techniques [26].\n\nEmpirical Bayesian (i.e., ML-II) strategies, such as Automatic Relevance Determination\n(ARD) [22], take the intermediate approach of integrating over parameters to obtain the marginal\nlikelihood (known as the log evidence), which is then optimized with respect to the hyperparame-\nters. Computing marginal likelihoods, however, can be quite costly, especially for log-linear models.\nOne method for doing this involves approximating the parameter posterior distribution as a Gaussian\ncentered at the posterior mode [22, 37]. In this strategy, however, the \u201cOccam factor\u201d used for hyper-\nparameter optimization still requires a Hessian computation, which does not scale well for log-linear\nmodels. An alternate approach based on using a modi\ufb01cation of expectation propagation (EP) [25]\nwas applied in the context of Bayesian CRFs [32] and later extended to graph-based semi-supervised\nlearning [14]. As described, however, inference in these models relies on non-traditional \u201cprobit-\nstyle\u201d potentials for ef\ufb01ciency reasons, and known algorithms for inference in Bayesian CRFs are\nlimited to graphical models with \ufb01xed structure.\n\nIn contrast, our approach works broadly for a variety of log-linear models, including the\ngrammar-based models common in computational biology and natural language processing. Fur-\nthermore, our algorithm is simple and ef\ufb01cient, both conceptually and in practice: one iteratively\noptimizes the parameters of a log-linear model using a \ufb01xed setting of the hyperparameters, and then\none changes the hyperparameters based on the holdout logloss gradient. The gradient computation\nrelies primarily on a simple conjugate gradient solver for linear systems, coupled with the ability\nto compute Hessian-vector products (straightforward in any modern programming language that al-\nlows for operation overloading). As we demonstrated in the context of RNA secondary structure\nprediction, gradient-based hyperparameter learning is a practical and effective method for tuning\nhyperparameters when applied to large-scale log-linear models.\n\nFinally we note that for neural networks, [9] and [5] proposed techniques for simultaneous opti-\nmization of hyperparameters and parameters; these results suggest that similar procedures for faster\nhyperparameter learning that do not require a doubly-nested optimization may be possible.\n\nReferences\n[1] L. Andersen, J. Larsen, L. Hansen, and M. Hintz-Madsen. Adaptive regularization of neural classi\ufb01ers.\n\nIn NNSP, 1997.\n\n[2] D. Anguita, S. Ridella, F. Rivieccio, and R. Zunino. Hyperparameter design criteria for support vector\n\nclassi\ufb01ers. Neurocomputing, 55:109\u2013134, 2003.\n\n4Following [7], we used the maximum expected accuracy algorithm for decoding, which returns a set of\ncandidates parses re\ufb02ecting different trade-offs between sensitivity (proportion of true base-pairs called) and\nspeci\ufb01city (proportion of called base-pairs which are correct).\n\n\f[3] Y. Bengio. Gradient-based optimization of hyperparameters. Neural Computation, 12:1889\u20131900, 2000.\n[4] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector\n\nmachines. Machine Learning, 46(1\u20133):131\u2013159, 2002.\n\n[5] D. Chen and M. Hagan. Optimal use of regularization and cross-validation in neural network modeling.\n\nIn IJCNN, 1999.\n\n[6] C. B. Do, S. S. Gross, and S. Batzoglou. CONTRAlign: discriminative training for protein sequence\n\nalignment. In RECOMB, pages 160\u2013174, 2006.\n\n[7] C. B. Do, D. A. Woods, and S. Batzoglou. CONTRAfold: RNA secondary structure prediction without\n\nphysics-based models. Bioinformatics, 22(14):e90\u2013e98, 2006.\n\n[8] K. Duan, S. S. Keerthi, and A.N. Poo. Evaluation of simple performance measures for tuning SVM\n\nhyperparameters. Neurocomputing, 51(4):41\u201359, 2003.\n\n[9] R. Eigenmann and J. A. Nossek. Gradient based adaptive regularization. In NNSP, pages 87\u201394, 1999.\n[10] T. Glasmachers and C. Igel. Gradient-based adaptation of general Gaussian kernels. Neural Comp.,\n\n17(10):2099\u20132105, 2005.\n\n[11] A. Globerson, T. Y. Koo, X. Carreras, and M. Collins. Exponentiated gradient algorithms for log-linear\n\nstructured prediction. In ICML, pages 305\u2013312, 2007.\n\n[12] C. Goutte and J. Larsen. Adaptive regularization of neural networks using conjugate gradient. In ICASSP,\n\n1998.\n\n[13] S. Grif\ufb01ths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, and A. Bateman. Rfam: annotating\n\nnon-coding RNAs in complete genomes. Nucleic Acids Res, 33:D121\u2013D124, 2005.\n\n[14] A. Kapoor, Y. Qi, H. Ahn, and R. W. Picard. Hyperparameter and kernel learning for graph based semi-\n\nsupervised classi\ufb01cation. In NIPS, pages 627\u2013634, 2006.\n\n[15] S. S. Keerthi. Ef\ufb01cient tuning of SVM hyperparameters using radius/margin bound and iterative algo-\n\nrithms. IEEE Transaction on Neural Networks, 13(5):1225\u20131229, 2002.\n\n[16] S. S. Keerthi, V. Sindhwani, and O. Chapelle. An ef\ufb01cient method for gradient-based adaptation of\n\nhyperparameters in SVM models. In NIPS, 2007.\n\n[17] K. Kobayashi, D. Kitakoshi, and R. Nakano. Yet faster method to optimize SVR hyperparameters based\n\non minimizing cross-validation error. In IJCNN, volume 2, pages 871\u2013876, 2005.\n\n[18] K. Kobayashi and R. Nakano. Faster optimization of SVR hyperparameters based on minimizing cross-\n\nvalidation error. In IEEE Conference on Cybernetics and Intelligent Systems, 2004.\n\n[19] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: probabilistic models for segmenting\n\nand labeling sequence data. In ICML 18, pages 282\u2013289, 2001.\n\n[20] J. Larsen, L. K. Hansen, C. Svarer, and M. Ohlsson. Design and regularization of neural networks: the\n\noptimal use of a validation set. In NNSP, 1996.\n\n[21] J. Larsen, C. Svarer, L. N. Andersen, and L. K. Hansen. Adaptive regularization in neural network\n\nmodeling. In Neural Networks: Tricks of the Trade, pages 113\u2013132, 1996.\n\n[22] D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415\u2013447, 1992.\n[23] D. J. C. MacKay and R. Takeuchi. Interpolation models with multiple hyperparameters. Statistics and\n\nComputing, 8:15\u201323, 1998.\n\nMath. Softw., 29(3):245\u2013262, 2003.\n\n362\u2013369, 2001.\n\n[24] J. R. R. A. Martins, P. Sturdza, and J. J. Alonso. The complex-step derivative approximation. ACM Trans.\n\n[25] T. P. Minka. Expectation propagation for approximate Bayesian inference. In UAI, volume 17, pages\n\n[26] I. Murray and Z. Ghahramani. Bayesian learning in undirected graphical models: approximate MCMC\n\nalgorithms. In UAI, pages 392\u2013399, 2004.\n\n[27] R. M. Neal. Bayesian Learning for Neural Networks. Springer, 1996.\n[28] A. Y. Ng. Preventing over\ufb01tting of cross-validation data. In ICML, pages 245\u2013253, 1997.\n[29] A. Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance. In ICML, 2004.\n[30] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 1999.\n[31] B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Comp, 6(1):147\u2013160, 1994.\n[32] Y. Qi, M. Szummer, and T. P. Minka. Bayesian conditional random \ufb01elds. In AISTATS, 2005.\n[33] M. Seeger. Cross-validation optimization for large scale hierarchical classi\ufb01cation kernel methods. In\n\nNIPS, 2007.\n\n[34] F. Sha and F. Pereira. Shallow parsing with conditional random \ufb01elds. In NAACL, pages 134\u2013141, 2003.\n[35] S. Sundararajan and S. S. Keerthi. Predictive approaches for choosing hyperparameters in Gaussian\n\nprocesses. Neural Comp., 13(5):1103\u20131118, 2001.\n\n[36] S. V. N. Vishwanathan, N. N. Schraudolph, M. W. Schmidt, and K. P. Murphy. Accelerated training of\n\nconditional random \ufb01elds with stochastic gradient methods. In ICML, pages 969\u2013976, 2006.\n\n[37] M. Wellings and S. Parise. Bayesian random \ufb01elds: the Bethe-Laplace approximation. In ICML, 2006.\n[38] C. K. I. Williams and D. Barber. Bayesian classi\ufb01cation with Gaussian processes. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 20(12):1342\u20131351, 1998.\n\n[39] X. Zhang and W. S. Lee. Hyperparameter learning for graph based semi-supervised learning algorithms.\n\nIn NIPS, 2007.\n\n\f", "award": [], "sourceid": 614, "authors": [{"given_name": "Chuan-sheng", "family_name": "Foo", "institution": null}, {"given_name": "Chuong", "family_name": "B.", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}