{"title": "A New Approximate Maximal Margin Classification Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 500, "page_last": 506, "abstract": null, "full_text": "A New Approximate Maximal Margin \n\nClassification Algorithm \n\nClaudio Gentile \n\nDSI, Universita' di Milano, \n\nVia Comelico 39, \n20135 Milano, Italy \n\ngentile@dsi.unimi.it \n\nAbstract \n\nA new incremental learning algorithm is described which approximates \nthe maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly \nseparable data. Our algorithm, called ALMAp (Approximate Large Mar-\ngin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to sepa(cid:173)\nrate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the \np-norm margin of the data and X is a bound on the p-norm of the in(cid:173)\nstances. ALMAp avoids quadratic (or higher-order) programming meth(cid:173)\nods. It is very easy to implement and is as fast as on-line algorithms, such \nas Rosenblatt's perceptron. We report on some experiments comparing \nALMAp to two incremental algorithms: Perceptron and Li and Long's \nROMMA. Our algorithm seems to perform quite better than both. The \naccuracy levels achieved by ALMAp are slightly inferior to those obtained \nby Support vector Machines (SVMs). On the other hand, ALMAp is quite \nfaster and easier to implement than standard SVMs training algorithms. \n\n1 Introduction \n\nA great deal of effort has been devoted in recent years to the study of maximal margin \nclassifiers. This interest is largely due to their remarkable generalization ability. In this pa(cid:173)\nper we focus on special maximal margin classifiers, i.e., on maximal margin hyperplanes. \nBriefly, given a set linearly separable data, the maximal margin hyperplane classifies all the \ndata correctly and maximizes the minimal distance between the data and the hyperplane. \nIf euclidean norm is used to measure the distance then computing the maximal margin hy(cid:173)\nperplane corresponds to the, by now classical, Support Vector Machines (SVMs) training \nproblem [3]. This task is naturally formulated as a quadratic programming problem. If an \narbitrary norm p is used then such a task turns to a more general mathematical program(cid:173)\nming problem (see, e.g., [15, 16]) to be solved by general purpose (and computationally \nintensive) optimization methods. This more general task arises in feature selection prob(cid:173)\nlems when the target to be learned is sparse. \n\nA major theme of this paper is to devise simple and efficient algorithms to solve the maxi(cid:173)\nmal margin hyperplane problem. The paper has two main contributions. The first contribu(cid:173)\ntion is a new efficient algorithm which approximates the maximal margin hyperplane w.r.t. \n\n\fnorm p to any given accuracy. We call this algorithm ALMAp (Approximate Large Margin \nalgorithm w.r.t. norm p). ALMAp is naturally viewed as an on-line algorithm, i.e., as an \nalgorithm which processes the examples one at a time. A distinguishing feature of ALMAp \nis that its relevant parameters (such as the learning rate) are dynamically adjusted over \ntime. In this sense, ALMAp is a refinement of the on-line algorithms recently introduced in \n[2]. Moreover, ALMA2 (i.e., ALMAp with p = 2) is a perceptron-like algorithm; the oper(cid:173)\nations it performs can be expressed as dot products, so that we can replace them by kernel \nfunctions evaluations. ALMA2 approximately solves the SVMs training problem, avoiding \nquadratic programming. As far as theoretical performance is concerned, ALMA2 achieves \nessentially the same bound on the number of corrections as the one obtained by a version \nof Li and Long's ROMMA algorithm [12], though the two algorithms are different. l In the \ncase that p is logarithmic in the dimension of the instance space (as in [6]) ALMAp yields \nresults which are similar to those obtained by estimators based on linear programming (see \n[1, Chapter 14]). \n\nThe second contribution of this paper is an experimental investigation of ALMA2 on the \nproblem of handwritten digit recognition. For the sake of comparison, we followed the \nexperimental setting described in [3,4,12]. We ran ALMA2 with polynomial kernels, using \nboth the last and the voted hypotheses (as in [4]), and we compared our results to those \ndescribed in [3, 4, 12]. We found that voted ALMA2 generalizes quite better than both \nROMMA and the voted Perceptron algorithm, but slightly worse than standard SVMs. On \nthe other hand ALMA2 is much faster and easier to implement than standard SVMs training \nalgorithms. \n\nFor related work on SVMs (with p = 2), see Friess et al. [5], Platt [17] and references \ntherein. \n\nThe next section defines our major notation and recalls some basic preliminaries. In Sec(cid:173)\ntion 3 we describe ALMAp and claim its theoretical properties. Section 4 describes our \nexperimental comparison. Concluding remarks are given in the last section. \n\n2 Preliminaries and notation \n\nAn example is a pair (x, y), where x is an instance belonging to n nand y E {-1, + 1 } \nis the binary label associated with x. A weight vector w = (WI, ... , wn) E nn rep(cid:173)\nresents an n-dimensional hyperplane passing through the origin. We associate with w a \nlinear threshold classifier with threshold zero: w : x -t sign( w . x) = 1 if w . x ~ 0 \nand = -1 otherwise. When p ~ 1 we denote by Ilwllp the p-norm of w, i.e., Ilwllp \n= (E~=llwiIP)llp (also, Ilwll oo = limp-+oo (E~=llwiIP)llp = maxi IWil). We say that \nq is dual to p if ~ + ~ = 1 holds. For instance, the I-norm is dual to the oo-norm and \nthe 2-norm is self-dual. In this paper we assume that p and q are some pair of dual values, \nwith p ~ 2. We use p-norms for instances and q-norms for weight vectors. The (normal(cid:173)\nized) p-norm margin (or just the margin, if p is clear from the surrounding context) of a \nhyperplane w with Ilwll q :s; 1 on example (x,y) is defined as ~I~I'I~' If this margin is \npositive2 then w classifies (x, y) correctly. Notice that from Holder's inequality we have \nyw\u00b7 x:S; Iw, xl :s; Ilwll q Ilxllp :s; Ilxll p. Hence ~I~I'I~ E [-1,1] . \nOur goal is to approximate the maximal p-norm margin hyperplane for a set of examples \n(the training set). For this purpose, we use terminology and analytical tools from the on-line \n\nlIn fact, algorithms such as ROMMA and the one contained in Kowalczyk [10] have been specif(cid:173)\n\nically designed for euclidean norm. Any straightforward extension of these algorithms to a general \nnorm p seems to require numerical methods . \n\n2We assume that w . x = 0 yields a wrong classification, independent of y. \n\n\flearning literature. We focus on an on-line learning model introduced by Littlestone [14]. \nAn on-line learning algorithm processes the examples one at a time in trials. In each trial, \nthe algorithm observes an instance x and is required to predict the label y associated with \nx . We denote the prediction by f). The prediction f) combines the current instance x with \nthe current internal state of the algorithm. In our case this state is essentially a weight vector \nw , representing the algorithm's current hypothesis about the maximal margin hyperplane. \nAfter the prediction is made, the true value of y is revealed and the algorithm suffers a loss, \nmeasuring the \"distance\" between the prediction f) and the label y. Then the algorithm \nupdates its internal state. \nIn this paper the prediction f) can be seen as the linear function f) = w . x and the loss is \na margin-based 0-1 Loss: the loss of W on example (x, y) is 1 if ~I~I'I~ ~ (1 - a) \"y and \no otherwise, for suitably chosen a, \"y E [0,1]. Therefore, if Ilwll q ~ 1 then the algorithm \nincurs positive loss if and only if w classifies (x, y) with (p-norm) margin not larger than \n(1- a) \"y. The on-line algorithms are typically loss driven, i.e., they do update their internal \nstate only in those trials where they suffer a positive loss. We call a correction a trial where \nthis occurs. In the special case when a = 1 a correction is a mistaken trial and a loss \ndriven algorithm turns to a mistake driven [14] algorithm. Throughout the paper we use the \nsubscript t for x and y to denote the instance and the label processed in trial t. We use the \nsubscript k for those variables, such as the algorithm's weight vectorw, which are updated \nonly within a correction. In particular, Wk denotes the algorithm's weight vector after k-l \ncorrections (so that WI is the initial weight vector). The goal of the on-line algorithm is to \nbound the cumulative loss (i.e., the total number of corrections or mistakes) it suffers on an \narbitrary sequence of examples S = (Xl, yd, ... , (XT, YT). If S is linearly separable with \nmargin \"y and we pick a < 1 then a bounded loss clearly implies convergence in a finite \nnumber of steps to (an approximation of) the maximal margin hyperplane for S. \n\n3 The approximate large margin algorithm ALMAp \n\nALMAp is a large margin variant of the p-norm Perceptron algorithm 3 [8, 6], and is similar \nin spirit to the variable learning rate algorithms introduced in [2]. We analyze ALMAp by \ngiving upper bounds on the number of corrections. \n\nThe main theoretical result of this paper is Theorem 1 below. This theorem has two parts. \nPart 1 bounds the number of corrections in the linearly separable case only. In the special \ncase when p = 2 this bound is very similar to the one proven by Li and Long for a version \nof ROMMA [12]. Part 2 holds for an arbitrary sequence of examples. A bound which is \nvery close to the one proven in [8, 6] for the (constant learning rate) p-norm Perceptron \nalgorithm is obtained as a special case. Despite this theoretical similarity, the experiments \nwe report in Section 4 show that using our margin-sensitive variable learning rate algorithm \nyields a clear increase in performance. \n\nIn order to define our algorithm, we need to recall the following mapping f from [6] (a \np-indexing for f is understood): f : nn -t nn, f = (h, ... , fn), where \n\nh(w) = sign(wi) IWilq-1 / Ilwllr2, w = (WI, ... , Wn) E nn. \n\nObserve that p = 2 yields the identify function. The (unique) inverse f- 1 of f is [6] \nf- 1 : nn -t nn, f-1 = U11, ... , f;;l), where \n\nf i- 1((}) = sign(Bi) IBi lp-1 / 11(}11~-2, () = (B1, ... ,Bn) E nn, \n\nnamely, f-1 is obtained from f by replacing q with p. \n\n3The p-norm Perceptron algorithm is a generalization of the classical Perceptron algorithm [18]: \n\np-norm Perceptron is actually Perceptron when p = 2. \n\n\fAlgorithm ALMAp(aj B, C) \nwith a E (0,1], B, C > O. \nInitialization: Initial weight vector WI = 0; k = 1. \nFort = 1, ... ,Tdo: \nGet example (Xt, Yt) E nn x {-I, +1} and update weights as follows: \n\nSet 'Yk = B JP=l ~j \nIf Yt Wk\u00b7Xt <_ (1 - a) \"\"k then: \n\nIlxtllp \n\nI \n\n'11 \n-\n'Ik -\n\nC \n\nv'P=I IIXtllp Vii' \n\nI \n\nW~ = f-l(f(Wk) + 'T/k Yt Xt), \nWk+1 = w~/max{l, Ilw~llq}, \nk+-k+1. \n\nFigure 1: The approximate large margin algorithm ALMAp . \n\nALMAp is described in Figure 1. The algorithm is parameterized by a E (0,1], B > 0 \nand C > O. The parameter a measures the degree of approximation to the optimal margin \nhyperplane, while Band C might be considered as tuning parameters. Their use will be \nmade clear in Theorem 1 below. Let W = {w E nn : Ilwllq :::; I}. ALMAp maintains a \nvector Wk of n weights in W. It starts from WI = O. At time t the algorithm processes the \nexample (Xt, Yt). If the current weight vector Wk classifies (Xt, Yt) with margin not larger \nthan (1 - a) 'Yk then a correction occurs. The update rule4 has two main steps. The first \nstep gives w~ through the classical update of a (p-norm) perceptron-like algorithm (notice, \nhowever, that the learning rate 'T/k scales with k, the number of corrections occurred so \nfar). The second step gives Wk+1 by projecting w~ onto W : Wk+1 = wklllw~llq if \nIlw~llq > 1 and Wk+1 = w~ otherwise. The projection step makes the new weight vector \nWk+1 belong to W. \n\nThe following theorem, whose proof is omitted due to space limitations, has two parts. In \npart 1 we treat the separable case. Here we claim that a special choice of the parameters \nBand C gives rise to an algorithm which approximates the maximal margin hyperplane \nto any given accurary a. In part 2 we claim that if a suitable relationship between the \nparameters Band C is satisfied then a bound on the number of corrections can be proven \nin the general (nonseparable) case. The bound of part 2 is in terms of the margin-based \nquantity V-y(Uj (x,y)) = max{O,'Y - rl~i~}' 'Y > O. \n(Here a p-indexing for V-y is \nunderstood). V-y is called deviation in [4] and linear hinge loss in [7]. \nNotice that Band C in part 1 do not meet the requirements given in part 2. On the other \nhand, in the separable case Band C chosen in part 2 do not yield a hyperplane which is \narbitrarily (up to a small a) close to the maximal margin hyperplane. \n\nTheorem 1 Let W = {w E nn : Ilwllq :::; I}, S = ((xI,yd, .'\" (XT,YT)) E (nn x \n{ -1, + 1 } ) T, and M be the set of corrections of ALM Ap( aj B, C) running on S (i. e., the \nset of trials t such that Ytll~t\\l~t :::; (1 - a hk). \n1. Let 'Y* = maxWEW mint=I, ... ,T Y{I~itt > O. Then ALMAp(aj v'8/a, y'2) achieves \nthefollowing bouncP on IMI: \n\n2 (p - 1) (2 \n\nIMI:::; \n\n(,,(*)2 \n\n) 2 \n\n( p - 1 ) \n~ -1 + ~ - 4 = 0 a2 (,,(*)2 \n\n8 \n\n. \n\n(1) \n\n41n the degenerate case that Xt = 0 no update takes place. \n5We did not optimize the constants here. \n\n\fFurthermore, throughout the run of ALMAp(a; VS/a, v'2) we have 'Yk ~ 'Y*. Hence (1) is \nalso an upper bound on the number of trials t such that Vil~,\\I~' :-::; (1 - a) 'Y*. \n2. Let the parameters Band C in Figure 1 satisfy the equation6 \n\nC2 + 2 (1- a) B C = l. \n\nThen for any u E W, ALMAp( a; B, C) achieves the followin g bound on 1M I, holding for \nany'Y> 0, where p2 = J2~2: \n\nObserve that when a = 1 the above inequality turns to a bound on the number of mistaken \ntrials. In such a case the value of 'Yk (in particular; the value of B) is immaterial, while C \nis forced to be 1. 0 \n\nWhen p = 2 the computations performed by ALMAp essentially involve only dot products \n(recall that p = 2 yields q = 2 and [ = [-1 = identity). Thus the generalization of \nALMA2 to the kernel case is quite standard. In fact, the linear combination W k+1 . X can \nbe computed recursively, since Wk+1 . x = W k \u00b7Xt'1 k V,X,.X. Here the denominator Nk+1 \nequals max{1, Ilw~112} and the norm Ilw~J12 is again computed recursively by Ilw~ll~ = \nIlw~_lIIVN~ + 2'T/k YtWk' Xt + 'T/~ Ilxtl12' where the dot product Wk' Xt is taken from \nthe k-th correction (the trial where the k-th weight update did occur). \n\nk+l \n\n4 Experimental results \n\nWe did some experiments running ALMA2 on the well-known MNIST OCR database.? \nEach example in this database consists of a 28x28 matrix representing a digitalized image \nof a handwritten digit, along with a {0,1, ... ,9}-valued label. Each entry in this matrix is a \nvalue in {O, 1, ... ,255}, representing a grey level. The database has 60000 training examples \nand 10000 test examples. The best accuracy results for this dataset are those obtained by \nLeCun et al. [11] through boosting on top of the neural net LeNet4. They reported a test \nerror rate of 0.7%. A soft margin SVM achieved an error rate of 1.1 % [3]. \nIn our experiments we used ALMA2(a; -i-, v'2) with different values of a. In the follow(cid:173)\ning ALMA2(a) is shorthand for ALMA2(a; -i-, v'2). We compared to SVMs, the Perceptron \nalgorithm and the Perceptron-like algorithm ROMMA [12]. We followed closely the ex(cid:173)\nperimental setting described in [3, 4, 12]. We used a polynomial kernel K of the form \nK(x, y) = (1 + x . y)d, with d = 4. (This choice was best in [4] and was also made in \n[3, 12].) However, we did not investigate any careful tuning of scaling factors. In particular, \nwe did not determine the best instance scaling factor s for our algorithm (this corresponds \nto using the kernel K (x, y) = (1 + x . y / S )d). In our experiments we set s = 255. This \nwas actually the best choice in [12] for the Perceptron algorithm. We reduced the lO-class \nproblem to 10 binary problems . Classification is made according to the maximum output \nof the 10 binary classifiers. The results are summarized in Table 1. As in [4], the output \nof a binary classifier is based on either the last hypothesis produced by the algorithms (de(cid:173)\nnoted by \"last\" in Table 1) or Helmbold and Warmuth's [9] leave-one-out voted hypothesis \n(denoted by \"voted\"). We refer the reader to [4] for details. We trained the algorithms by \ncycling up to 3 times (\"epochs\") over the training set. All the results shown in Table 1 \nare averaged over 10 random permutations of the training sequence. The columns marked \n\n6Notice that Band C in part 1 do not satisfy this relationship. \n7 Available on Y. LeCun's home page: http://www.research.att.com/ ... yann/ocr/mnisti. \n\n\f\"Corr's\" give the total number of corrections made in the training phase for the 10 labels. \nThe first three rows of Table 1 are taken from [4, 12, 13]. The first two rows refer to the \nPerceptron algorithm,8 while the third one refers to the best 9 noise-controlled (NC) version \nof ROMMA, called \"aggressive ROMMA\". Our own experimental results are given in the \nlast six rows. \n\nAmong these Perceptron-like algorithms, ALMA2 \"voted\" seems to be the most accurate. \nThe standard deviations about our averages are reasonably small. Those concerning test \nerrors range in (0.03%,0.09%). These results also show how accuracy and running time \n(as well as sparsity) can be traded-off against each other in a transparent way. The accuracy \nof our algorithm is slightly worse than SVMs'. On the other hand, our algorithm is quite \nfaster and easier to implement than previous implementations of SVMs, such as those given \nin [17,5] . An interesting features of ALMA2 is that its approximate solution relies on fewer \nsupport vectors than the SVM solution. \n\nWe found the accuracy of 1.77 for ALMA2(1.0) fairly remarkable, considering that it has \nbeen obtained by sweeping through the examples just once for each of the ten classes. \nIn fact, the algorithm is rather fast: training for one epoch the ten binary classifiers of \nALMA2(1.0) takes on average 2.3 hours and the corresponding testing time is on aver(cid:173)\nage about 40 minutes. (All our experiments have been performed on a PC with a single \nPentium\u00ae III MMX processor running at 447 Mhz.) \n\n5 Concluding Remarks \n\nIn the full paper we will give more extensive experimental results for ALMA2 and ALMAp \nwith p > 2. One drawback of ALMAp'S approximate solution is the absence of a bias term \n(i.e., a nonzero threshold). This seems to make little difference for MNIST dataset, but \nthere are cases when a biased maximal margin hyperplane generalizes quite better than an \nunbiased one. It is not clear to us how to incorporate the SVMs' bias term in our algorithm. \nWe leave this as an open problem. \n\nTable 1: Experimental results on MNIST database. \"TestErr\" denotes the fraction of mis(cid:173)\nclassified patterns in the test set, while \"Corr's\" gives the total number of training correc(cid:173)\ntions for the 10 labels. Recall that voting takes place during the testing phase. Thus the \nnumber of corrections of \"last\" is the same as the number of corrections of \"voted\". \n\nPerceptron \n\n\"last\" \n\"voted\" \n\nagg-ROMMA(NC) (\"last\") \nALMA2(1.0) \n\nALMA2(0.9) \n\nALMA2(0.8) \n\n\"last\" \n\"voted\" \n\"last\" \n\"voted\" \n\"last\" \n\"voted\" \n\n1 Epoch \n\nTestErr Corr's \n7901 \n2.71% \n2.23% \n7901 \n2.05% 30088 \n2.52% \n7454 \n7454 \n1.77% \n2.10% \n9911 \n9911 \n1.69% \n1.98% \n12810 \n12810 \n1.68% \n\n2 Epochs \n\nTestErr Corr's \n10421 \n2.14% \n1.86% \n10421 \n1.76% 44495 \n2.01% \n9658 \n1.52% \n9658 \n1.74% \n12711 \n12711 \n1.49% \n1.72% \n16464 \n16464 \n1.44% \n\n3 Epochs \n\nTestErr Corr's \n11787 \n2.03% \n1.76% \n11787 \n1.67% 58583 \n10934 \n1.86% \n10934 \n1.47% \n14244 \n1.64% \n14244 \n1.40% \n1.60% \n18528 \n18528 \n1.35% \n\n8These results have been obtained with no noise control. It is not clear to us how to incorporate any \nnoise control mechanism into the classical Perceptron algorithm. The method employed in [10, 12] \ndoes not seem helpful in this case, at least for the first epoch. \n\n9 According to [12], ROMMA's last hypothesis seems to perform better than ROMMA's voted \n\nhypothesis. \n\n\fAcknowledgments \n\nThanks to Nicolo Cesa-Bianchi, Nigel Duffy, Dave Helmbold, Adam Kowalczyk, Yi Li, \nNick Littlestone and Dale Schuurmans for valuable conversations and email exchange. We \nwould also like to thank the NIPS2000 anonymous reviewers for their useful comments and \nsuggestions. The author is supported by a post-doctoral fellowship from Universita degli \nStudi di Milano. \n\nReferences \n[1] M. Anthony, P. Bartlett, Neural Network Learning: Theoretical Foundations, CMU, 1999. \n[2] \n\nP. Auer and C. Gentile Adaptive and self-confident on-line learning algorithms. In 13th COLT, \n107- 117, 2000. \n\n[3] C. Cortes, V. Vapnik. Support-vector networks. Machine Learning, 20, 3: 273- 297, 1995. \n[4] Y. Freund and R. Schapire. Large margin classification using the perceptron algorithm. Journal \n\nof Machine Learning, 37, 3: 277- 296, 1999. \n\n[5] T.-T. Friess, N. Cristianini, and C. Campbell. The kernel adatron algorithm: a fast and simple \n\nleaming procedure for support vector machines. In 15th ICML, 1998. \n\n[6] C. Gentile and N. Littlestone. The robustness of the p-norm algorithms. In 12th COLT, 1- 11 , \n\n1999. \n\n[7] c. Gentile, and M. K. Warmuth. Linear hinge loss and average margin. In 11th NIPS, 225- 231 , \n\n1999. \n\n[8] A. I . Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discrim(cid:173)\n\ninant updates. In 10th COLT, 171- 183, 1997. \n\n[9] D. Helmbold and M. K. Warmuth. On weak learning. JCSS, 50, 3: 551- 573, 1995. \n[10] A. Kowalczyk. Maximal margin perceptron. In Smola, Bartlett, Scholkopf, and Schuurmans \n\neditors, Advances in large margin classifiers, MIT Press, 1999. \n\n[11] Y. Le Cun, L.I. Iackel, L. Bottou, A. Brunot, C. Cortes, I.S. Denker, H. Drucker, I. Guyon, U. \nMuller, S. Sackinger, P. Simard, and V. Vapnik, Comparison of learning algorithms for hand(cid:173)\nwritten digit recognition. In ICANN, 53-60, 1995. \n\n[12] Y. Li, and P. Long. The relaxed online maximum margin algorithm. In 12th NIPS, 498- 504, \n\n2000. \n\n[13] Y. Li. From support vector machines to large margin classifiers, PhD Thesis, School of Com(cid:173)\n\nputing, the National University of Singapore, 2000. \n\n[14] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold \n\nalgorithm. Machine Learning, 2:285- 318, 1988. \n\n[15] O. Mangasarian, Mathematical programming in data mining. Data Mining and Knowledge Dis(cid:173)\n\ncovery,42, 1: 183- 201, 1997. \n\n[16] P. Nachbar, I.A. Nossek, I. Strobl, The generalized adatron algorithm. In Proc. 1993 IEEE \n\nISCAS, 2152-5, 1993. \n\n[17] I . C. Platt. Fast training of support vector machines using sequential minimal optimization. In \nScholkopf, Burges and Smola editors, Advances in kernel methods: support vector machines, \nMIT Press, 1998. \n\n[18] F. Rosenblatt. Principles of neurodynamics: Perceptrons and the theory of brain mechanisms. \n\nSpartan Books, Washington, D.C., 1962. \n\n\f", "award": [], "sourceid": 1864, "authors": [{"given_name": "Claudio", "family_name": "Gentile", "institution": null}]}