{"title": "Computing regularization paths for learning multiple kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 73, "page_last": 80, "abstract": null, "full_text": " Computing regularization paths\n for learning multiple kernels\n\n\n\n Francis R. Bach & Romain Thibaux Michael I. Jordan\n Computer Science Computer Science and Statistics\n University of California University of California\n Berkeley, CA 94720 Berkeley, CA 94720\n {fbach,thibaux}@cs.berkeley.edu jordan@cs.berkeley.edu\n\n\n\n Abstract\n\n The problem of learning a sparse conic combination of kernel functions\n or kernel matrices for classification or regression can be achieved via the\n regularization by a block 1-norm [1]. In this paper, we present an al-\n gorithm that computes the entire regularization path for these problems.\n The path is obtained by using numerical continuation techniques, and\n involves a running time complexity that is a constant times the complex-\n ity of solving the problem for one value of the regularization parameter.\n Working in the setting of kernel linear regression and kernel logistic re-\n gression, we show empirically that the effect of the block 1-norm reg-\n ularization differs notably from the (non-block) 1-norm regularization\n commonly used for variable selection, and that the regularization path is\n of particular value in the block case.\n\n\n\n1 Introduction\n\nKernel methods provide efficient tools for nonlinear learning problems such as classifica-\ntion or regression. Given a learning problem, two major tasks faced by practitioners are to\nfind an appropriate kernel and to understand how regularization affects the solution and its\nperformance. This paper addresses both of these issues within the supervised learning set-\nting by combining three themes from recent statistical machine learning research, namely\nmultiple kernel learning [2, 3, 1], computation of regularization paths [4, 5], and the use of\npath following methods [6].\n\nThe problem of learning the kernel from data has recently received substantial attention,\nand several formulations have been proposed that involve optimization over the conic struc-\nture of the space of kernels [2, 1, 3]. In this paper we follow the specific formulation of [1],\nwho showed that learning a conic combination of basis kernels is equivalent to regularizing\nthe original supervised learning problem by a weighted block 1-norm (see Section 2.2 for\nfurther details). Thus, by solving a single convex optimization problem, the coefficients\nof the conic combination of kernels and the values of the parameters (the dual variables)\nare obtained. Given the basis kernels and their coefficients, there is one free parameter\nremaining--the regularization parameter.\n\nKernel methods are nonparametric methods, and thus regularization plays a crucial role in\ntheir behavior. In order to understand a nonparametric method, in particular complex non-\n\n\f\nparametric methods such as those considered in this paper, it is useful to be able to consider\nthe entire path of regularization, that is, the set of solutions for all values of the regulariza-\ntion parameter [7, 4]. Moreover, if it is relatively cheap computationally to compute this\npath, then it may be of practical value to compute the path as standard practice in fitting a\nmodel. This would seem particularly advisable in cases in which performance can display\nlocal minima along the regularization path. In such cases, standard local search methods\nmay yield unnecessarily poor performance.\n\nFor least-squares regression with a 1-norm penalty or for the support vector machine, there\nexist efficient computational techniques to explore the regularization path [4, 5]. These\ntechniques exploit the fact that for these problems the path is piecewise linear. In this paper\nwe consider the extension of these techniques to the multiple kernel learning problem. As\nwe will show (in Section 3), in this setting the path is no longer piecewise linear. It is,\nhowever, piecewise smooth, and we are able to follow it by using numerical continuation\ntechniques [8, 6]. To do this in a computationally efficient way, we invoke logarithmic bar-\nrier techniques analogous to those used in interior point methods for convex optimization\n(see Section 3.3). As we shall see, the complexity of our algorithms essentially depends\non the number of \"kinks\" in the path, i.e., the number of discontinuity points of the deriva-\ntive. Our experiments suggest that the number of those kinks is always less than a small\nconstant times the number of basis kernels. The empirical complexity of our algorithm is\nthus a constant times the complexity of solving the problem using interior point methods\nfor one value of the regularization parameter (see Section 3.4 for details).\n\nIn Section 4, we present simulation experiments for classification and regression problems,\nusing a large set of basis kernels based on the most widely used kernels (linear, polynomial,\nGaussian). In particular, we show empirically that the number of kernels in the conic com-\nbination is not a monotonic function of the amount of regularization. This contrasts with\nthe simpler non-block 1-norm case for variable selection (i.e., blocks of size one [4]), where\nthe number of variables is usually monotonic (or nearly so). Thus the need to compute full\nregularization paths is particularly acute in our more complex (block 1-norm regularization)\ncase.\n\n\n2 Block 1-norm regularization\n\nIn this section we review the block 1-norm regularization framework of [1] as it applies\nto differentiable loss functions. To provide necessary background we begin with a short\nreview of classical 2-norm regularization.\n\n\n2.1 Classical 2-norm regularization\n\nPrimal formulation We consider the general regularized learning optimization prob-\nlem [7], where the data xi, i = 1, . . . , n, belong to the input space X , and yi, i = 1, . . . , n\nare the responses (lying either in {-1, 1} for classification or R for regression). We map\nthe data into a feature space F through x (x). The kernel associated with this feature\nmap is denoted k(x, y) = (x) (y). The optimization problem is the following1:\n\n min n\n wRp (y ||w||2, (1)\n i=1 i, w (xi)) + 2\n\nwhere > 0 is a regularization parameter and ||w|| is the 2-norm of w, defined as ||w|| =\n(w w)1/2. The loss function is any function from R R to R. In this paper, we focus\non loss functions that are strictly convex and twice continuously differentiable in their\nsecond argument. Let i(v), v R, be the Fenchel conjugate [9] of the convex function\ni(u) = (yi, u), defined as i(v) = maxuR(vu - i(u)). Since we have assumed that\n\n 1We omit the intercept as it can be included by adding the constant variable equal to 1 to each\nfeature vector (xi).\n\n\f\n is strictly convex and differentiable, the maximum defining i(v) is attained at a unique\npoint equal to i(v) (possibly equal to + or -). The function i(v) is then strictly\nconvex and twice differentiable in its domain.\n\nIn particular, we have the following examples in mind: for least-squares regression, we\nhave i(u) = 1 (y v2 + vy\n 2 i - u)2 and i(v) = 1\n 2 i, while for logistic regression, we have\ni(u) = log(1 + exp(-yiui)), where yi {-1, 1}, and i(v) = (1 + vyi) log(1 + vyi) -\nvyi log(-vyi) if vyi (-1, 0), + otherwise.\n\nDual formulation and optimality conditions The Lagrangian for problem (1) is\n\n L(w, u, ) = ||w||2 - \n i i(ui) + \n 2 i i(ui - w (xi))\n\nand is minimized with respect to u and w with w = - \n i i(xi). The dual problem is\nthen\n maxRn - K , (2)\n i i(i) - \n 2\n\nwhere K Rnn is the kernel matrix of the points, i.e., Kab = k(xa, xb). The optimality\ncondition for the dual variable is then:\n\n i, (K)i + i(i) = 0 (3)\n\n\n2.2 Block 1-norm regularization\n\nIn this paper, we map the input space X to m different feature spaces F1, . . . , Fm, through\nm feature maps 1(x), . . . , m(x). We now have m different variables wj Fj, j =\n1, . . . , m. We use the notation (x) = (1(x), . . . , m(x)) and w = (w1, . . . , wm), and\nfrom now on, we use the implicit convention that the index i ranges over data points (from\n1 to n), while the index j ranges over kernels/feature spaces (from 1 to m).\n\nLet dj, j = 1, . . . , m, be weights associated with each kernel. We will see in Section 4 how\nthese should be linked to the rank of the kernel matrices. Following [1], we consider the fol-\nlowing problem with weighted block 1-norm regularization2 (where ||wj|| = (wj wj)1/2\nstill denotes the 2-norm of wj):\n\n minwF d\n 1Fm i i(w (xi)) + j j ||wj ||. (4)\n\nThe problem (4) is a convex problem, but not differentiable. In order to derive optimal-\nity conditions, we can reformulate it with conic constraints and derive the following dual\nproblem (we omit details for brevity) [9, 1]:\n\n max - \n i i(i) such that j, Kj d2j (5)\n\nwhere Kj is the kernel matrix associated with kernel kj, i.e., defined as (Kj)ab =\nkj(xa, xb). From the KKT conditions for problem Eq. (5), we obtain that the dual vari-\nable is optimal if and only if there exists Rm such that 0 and\n\n i, ( \n j j Kj )i + i(i) = 0 (6)\n\n j, Kj d2j, j 0, j(d2i - Kj) = 0.\n\nWe can go back and forth between optimal w and by w = - Diag() \n i ixi or\ni = 1 \n i(w xi).\nWe see that the solution of Eq. (5) can be obtained by using only the kernel matrices Kj\n(i.e., this is indeed a kernel machine) and that the optimal solution of the block 1-norm\n\n 2In [1], the square of the block 1-norm was used. However, when the entire regularization path is\nsought, it is easy to show that the two problems are equivalent. The advantage of the current formula-\ntion is that when the blocks are of size one the problem reduces to classical 1-norm regularization [4].\n\n\f\n target\n /\n path Predictor step\n Corrector steps\n ( , ) ( , )\n 0 0 1 1\n\n Path\nFigure 1: (Left) Geometric interpretation of the dual problem in Eq. (5) for linear regres-\nsion; see text for details. (Right) Predictor-corrector algorithm.\n\n\n\nproblem in Eq. (5), with optimality conditions in Eq. (6), is the solution of the regular 2-\nnorm problem in Eq. (2) with kernel K = \n j j Kj . Thus, with this formulation, we learn\nthe coefficients of the conic combination of kernels as well as the dual variables [1]. As\nshown in [1], the conic combination is sparse, i.e., many of the coefficients j are equal to\nzero.\n\n\n2.3 Geometric interpretation of dual problem\n\nEach function i is strictly convex, with a strict minimum at i defined by i(i) = 0\n(for least-squares regression we have i = -yi, and for the logistic regression we have\ni = -yi/2). The negated dual objective \n i i(i) is thus a metric between and\n/ (for least-squares regression, this is simply the squared distance while for logistic\nregression, this is an entropy distance). Therefore, the dual problem aims to minimize a\nmetric between and the target /, under the constraint that belongs to an intersection\nof m ellipsoids { Rn, Kj d2j}.\nWhen computing the regularization path from = + to = 0, the target goes from 0\nto in the direction (see Figure 1). The geometric interpretation immediately implies\nthat as long as 1 K\n 2 j d2j, the active set is empty, the optimal is equal to /\nand the optimal w is equal to 0. We thus initialize the path following technique with\n = maxj( Kj/d2j)1/2 and = /.\n\n3 Building the regularization path\n\nIn this section, the goal is to vary from + (no regularization) to 0 (full regulariza-\ntion) and obtain a representation of the path of solutions ((), ()). We will essentially\napproximate the path by a piecewise linear function of = log().\n\n\n3.1 Active set method\n\nFor the dual formulation Eq. (5)-Eq. (6), if the set of active kernels J () is known, i.e., the\nset of kernels that are such that Kj = d2j, then the optimality conditions become\n\n j J , Kj = d2j (7)\n i, ( \n jJ j Kj )i + i(i) = 0\n\nand they are valid as long as j /\n J , Kj d2j and j J , j 0.\n\nThe path is thus piecewise smooth, with \"kinks\" at each point where the active set J\nchanges. On each of the smooth sections, only those kernels with index belonging to J\nare used to define and , through Eq. (7). When all blocks have size one, or equivalently\nwhen all kernel matrices have rank one, then the path is provably linear in 1/ between\neach kink [4] and is thus easy to follow. However, when the kernel matrices have higher\n\n\f\nrank, this is not the case and additional numerical techniques are needed, which we now\npresent. In the regularized formulation we present in Section 3.3, the optimal is a function\nof , and therefore we only have to follow the optimal , as a function of = log().\n\n\n3.2 Following a smooth path using numerical continuation techniques\n\nIn this section, we provide a brief review of path following, focusing on predictor-corrector\nmethods [8]. We assume that the function () Rd is defined implicitly by J(, ) = 0,\nwhere J is C from Rd+1 to Rd and is a real variable. Starting from a point 0, 0\nsuch that J (0, 0) = 0, by the implicit function theorem, the solution is well defined\nand C if the differential J \n Rdd is invertible. The derivative at 0 is then equal to\nd -1\n ( ( J (\nd 0) = - J\n 0, 0) 0, 0).\n\nIn order to follow the curve (), the most effective numerical method is the predictor-\ncorrector method, which works as follows (see Figure 1):\n\n predictor step : from (0, 0) predict where (0 + h) should be using the first order\n expansion, i.e., take 1 = 0 + h, 1 = 0 + h d (\n d 0) (note that h can be chosen\n positive or negative, depending on the direction we want to follow).\n\n corrector steps : (1, 1) might not satisfy J(1, 1) = 0, i.e., the tangent prediction\n might (and generally will) leave the curve (). In order to return to the curve, New-\n ton's method is used to solve the nonlinear system of equations (in ) J (, 1) = 0,\n starting from = 1. If h is small enough, then the Newton steps will converge\n quadratically to a solution 2 of J (, 1) = 0 [8].\n\nMethods that do only one of the two steps are not as efficient: doing only predictor steps\nis not stable and the algorithm leaves the path very quickly, whereas doing only corrector\nsteps (with increasing ) is essentially equivalent to seeding the optimizer for a given \nwith the solution for a previous , which is very inefficient in sections where the path is\nclose to linear. Predictor-corrector methods approximate the path by a sequence of points\non that path, which can be joined to provide a piecewise linear approximation.\n\nAt first glance, in order to follow the piecewise smooth path all that is needed is to follow\neach piece and detect when the active set changes, i.e, when j /\n J , Kj = d2j or\nj J , j = 0. However this approach can be tricky numerically [8]. We instead prefer\nto use a numerical regularization technique that will (a) make the entire path smooth, (b)\nmake sure that the Newton steps are globally convergent, and (c) will still enable us to use\nonly a subset of the kernels to define the path locally.\n\n\n3.3 Numerical regularization\n\nWe borrow a classical regularization method from interior point methods, in which a con-\nstrained problem is made unconstrained by using a convex log-barrier [9]. In the dual\nformulation, we solve the following problem (note that we now use a min-problem and\nwe have divided by 2, which leaves the problem unchanged), where is a fixed small\nconstant:\n\n min 1\n F (, ) where F (, ) = log(d2\n i 2 i(i) - \n 2 j j - Kj ) (8)\n\nFor fixed, F (, ) is C and strictly convex in its domain {, j, Kj < d2j},\nand thus the global minimum is uniquely defined by F = 0. If we define \n j () =\n/(d2j - Kj), then we have F = 1 \n i(i) + 1 j ()(Kj )i, and thus, the\n i j\noptimality condition for the problem with the log-barrier is exactly equivalent to the one in\nEq. (6). But now instead of having j(d2j - Kj) = 0 (which would define an optimal\nsolution of the numerically unregularized problem), we have j(d2j - Kj) = . Any\n\n\f\n 10\n 4\n 8\n 3\n 6\n \n 2 4\n\n 1 2\n\n 0 0\n 0 2 4 6 0 5 10\n -log() -log()\n\nFigure 2: Examples of variation of along the regularization path for linear regression\n(left) and logistic regression (right).\n\n\n\ndual-feasible variables and (not necessarily linked through a functional relationship)\ndefine primal-dual variables and the quantity j(d2j- Kj) is exactly the duality gap [9],\ni.e., the difference between the primal and dual objectives. Thus the parameter holds fixed\nthe duality gap we are willing to pay. In simulations, we used = 10-3.\n\nWe can apply the techniques of Section 3.2 to follow the path for a fixed , for the variables\n only, since is now a function of . The corrector steps, are not only Newton steps for\nsolving a system of nonlinear equations, they are also Newton-Raphson steps to minimize\na strictly convex function, and are thus globally convergent [9].\n\n\n3.4 Path following algorithm\n\nOur path following algorithm is simply a succession of predictor-corrector steps, described\nin Section 3.2, with J (, ) = F (, ) defined in Section 3.3, where = log(). The\n \ninitialization presented in Section 2.3 is used.\n\nIn Figure 2, we show simple examples of the values of the kernel weights along the\npath for a toy problem with a small number of kernels, for kernel linear regression and\nkernel logistic regression. It is worth noting that the weights are not even approximately\nmonotonic functions of ; also the behavior of those weights as approaches zero (or \ngrows unbounbed) is very specific: they become constant for linear regression and they\ngrow up to infinity for logistic regression. In Section 4, we show (a) why these behaviors\noccur and (b) what the consequences are regarding the performance of the multiple kernel\nlearning problem. In the remaining of this section, we review some important algorithmic\nissues3.\n\nStep size selection A major issue in path following methods is the choice of the step h:\nif h is too big, the predictor will end up very far from the path and many Newton steps have\nto be performed, while if h is too small, progress is too slow. We chose a simple adaptive\nscheme where at each predictor step we select the biggest h so that the predictor step stays\nin the domain |J (, )| . The precision parameter is itself adapted at each iteration:\nif the number of corrector steps at the previous iteration is greater than 8 then is decreased\nwhereas if this number is less than 4, it is increased.\n\nRunning time complexity Between each kink, the path is smooth, thus there is a bounded\nnumber of steps [8, 9]. Each of those steps has complexity O(n3 + mn2). We have\nobserved empirically that the overall number of those steps is O(m), thus the total empirical\ncomplexity is O(mn3 + m2n2). The complexity of solving the optimization problem in\nEq. (5) using an interior point method for only one value of the regularization parameter is\nO(mn3) [2], thus if m n, the empirical complexity of our algorithm, which yields the\nentire regularization path, is a constant times the complexity of obtaining only one point in\nthe path using an interior point method. This makes intuitive sense, as both methods follow\na path, by varying in the case of the interior point method, and by varying in our case.\nThe difference is that every point along our path is meaningful, not just the destination.\n\n 3A Matlab implementation can be downloaded from www.cs.berkeley.edu/~fbach .\n\n\f\n 0.4 0.4 1 1\n\n 0.8 0.8\n 0.3 0.3\n 0.6 0.6\n 0.2 0.2\n error rate 0.4 0.4\n error rate\n 0.1 0.1 0.2 0.2\n\n 0 0 mean square error 0 mean square error 0\n 0 2 4 6 8 0 2 4 6 8 0 5 10 0 5 10\n -log() -log() -log() -log()\n\n 50 50\n 30 30\n\n 40 40\n\n 20 20 30 30\n\n 20 20\n 10 10\n 10 10\n number of kernels number of kernels number of kernels number of kernels\n 0 0 0 0\n 0 2 4 6 8 0 2 4 6 8 0 5 10 0 5 10\n -log() -log() -log() -log()\n\nFigure 3: Varying the weights (dj): (left) classification on the Liver dataset, (right) regres-\nsion on the Boston dataset ; for each dataset, two different values of , (left) = 0 and\n(right) = 1 . (Top) training set accuracy in bold, testing set accuracy in dashed, (bottom)\nnumber of kernels in the conic combination.\n\n\nEfficient implementation Because of our numerical regularization, none of the j's are\nequal to zero (in fact each j is lower bounded by /d2j). We thus would have to use all\nkernels when computing the various derivatives. We circumvent this by truncating those j\nthat are close to their lower bound to zero: we thus only use the kernels that are numerically\npresent in the combination.\n\nSecond-order predictor step The implicit function theorem also allows to compute\nderivative of the path of higher orders. By using a second-order approximation of the path,\nwe can reduce significantly the number of predictor-corrector steps required for the path.\n\n\n4 Simulations\n\nWe have performed simulations on the Boston dataset (regression, 13 variables, 506 data\npoints) and Liver dataset (classification, 6 variables, 345 data points) from the UCI reposi-\ntory, with the following kernels: linear kernel on all variables, linear kernels on single vari-\nables, polynomial kernels (with 4 different orders), Gaussian kernels on all variables (with\n7 different kernel widths), Gaussian kernels on subsets of variables (also with 7 different\nkernel widths), and the identity matrix. This makes 110 kernels for the Boston dataset and\n54 for the Liver dataset. All kernel matrices were normalized to unit trace.\n\nIntuitively, the regularization weight dj for kernel Kj should be an increasing function of\nthe rank of Kj, i.e., we should penalize more feature spaces of higher dimensions. In order\nto explore the effect of dj on performance, we set dj as follows: we compute the number\npj of eigenvalues of Kj that are greater than 1 (remember that because of the unit trace\n 2n\nconstraint, these n eigenvalues sum to 1), and we take dj = pj. If = 0, then all dj's are\nequal to one, and when increases, kernel matrices of high rank such as the identity matrix\nhave relatively higher weights, noting that a higher weight implies a heavier regularization.\n\nIn Figure 3, for the Boston and liver datasets, we plot the number of kernels in the conic\ncombination as well as the training and testing errors, for = 0 and = 1. We can make\nthe following simple observations:\n\nNumber of kernels The number of kernels present in the sparse conic combination\nis a non monotonic function of the regularization parameter. When the blocks are one-\ndimensional, a situation equivalent to variable selection with a 1-norm penalty, this number\nis usually a nearly monotonic function of the regularization parameter [4].\n\nLocal minima Validation set performance may exhibit local minima, and thus algorithms\n\n\f\nbased on hill-climbing might exhibit poor performance by being trapped in a local mini-\nmum, whereas our approach where we compute the entire path would avoid that.\n\nBehavior for small For all values of , as goes to zero, the number of kernels remains\nthe same, the training error goes to zero, while the testing error remains constant. What\nchanges when changes is the value of at which this behavior appears; in particular, for\nsmall values of , it happens before the testing error goes back up, leading to an unusual\nvalidation performance curve (an usual cross-validation curve would diverge to large values\nwhen the regularization parameter goes to zero). It is thus crucial to use weights dj that\ngrow with the \"size\" of the kernel, and not simply constant.\n\nThis behavior can be confirmed by a detailed analysis of the optimality conditions, which\nshow that if one of the kernel has a flat spectrum (such as the identity matrix), then, as \ngoes to zero, tends to a limit, tends to a limit for linear regression and goes to infinity\nas log(1/) for logistic regression. Also, once in that limiting regime, the training error\ngoes to zero quickly, while the testing error remains constant.\n\n\n5 Conclusion\n\nWe have presented an algorithm to compute entire regularization paths for the problem\nof multiple kernel learning. Empirical results using this algorithm have provided us with\ninsight into the effect of regularization for such problems. In particular we showed that the\nbehavior of the block 1-norm regularization differs notably from traditional (non-block)\n1-norm regularization.\n\nAs presented, the empirical results suggest that our algorithm scales quadratically in the\nnumber of kernels, but cubically in the number of data points. Indeed, the main computa-\ntional burden (for both predictor and corrector steps) is the inversion of a Hessian. In order\nto make the computation of entire paths efficient for problems involving a large number of\ndata points, we are currently investigating inverse Hessian updating, a technique which is\ncommonly used in quasi-Newton methods [10].\n\nAcknowledgments\n\nWe wish to acknowledge support from NSF grant 0412995, a grant from Intel Corporation,\nand a graduate fellowship to Francis Bach from Microsoft Research.\n\nReferences\n\n [1] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and\n the SMO algorithm. In ICML, 2004.\n\n [2] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the\n kernel matrix with semidefinite programming. JMLR, 5:2772, 2004.\n\n [3] C. S. Ong, A. J. Smola, and R. C. Williamson. Hyperkernels. In NIPS 15, 2003.\n\n [4] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Ann. Stat.,\n 32(2):407499, 2004.\n\n [5] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization path for the support\n vector machine. In NIPS 17, 2005.\n\n [6] A. Corduneanu and T. Jaakkola. Continuation methods for mixing heterogeneous sources. In\n UAI, 2002.\n\n [7] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-Verlag,\n 2001.\n\n [8] E. L. Allgower and K. Georg. Continuation and path following. Acta Numer., 2:164, 1993.\n\n [9] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2003.\n\n[10] J. F. Bonnans, J. C. Gilbert, C. Lemarechal, and C. A. Sagastizbal. Numerical Optimization\n Theoretical and Practical Aspects. Springer, 2003.\n\n\f\n", "award": [], "sourceid": 2594, "authors": [{"given_name": "Francis", "family_name": "Bach", "institution": null}, {"given_name": "Romain", "family_name": "Thibaux", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}