{"title": "Optimal Stopping and Effective Machine Complexity in Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 303, "page_last": 310, "abstract": null, "full_text": "Optimal Stopping and Effective Machine \n\nComplexity in Learning \n\nChangfeng Wang \n\nDepartment of SystE'IIlS Sci. (Iud Ell/!,. \n\nUJliversity of PPIIIlsylv1I.Ili(l \n\nPhiladelphin, PA, U.S.A. I!HlJ4 \n\nSalltosh S. Venkatesh \n\nDp\u00bbartn}(,llt (If Elf'drical EugiJlPprinJ!, \n\nU IIi v('rsi ty (If Ppnllsyl va nia \n\nPhiladelphia, PA, U.S.A. 19104 \n\nJ. Stephen Judd \n\nSiemens Corporate Research \n\n755 College Rd. East, \n\nPrinceton, NJ, U.S.A. 08540 \n\nAbstract \n\nWe study tltt' problem of when to stop If'arning a class of feedforward networks \n- networks with linear outputs I1PUrOIl and fixed input weights - when they are \ntrained with a gradient descent algorithm on a finite number of examples. Under \ngeneral regularity conditions, it is shown that there a.re in general three distinct \nphases in the generalization performance in the learning process, and in particular, \nthe network has hetter gt'neralization pPTformance when learning is stopped at a \ncertain time before til(' global miniIl111lu of the empirical error is reachert. A notion \nof effective size of a machine is rtefil1e*-machines. \n\nWe consider the problem of learning from examples a relationship between a random variable Y \nand an n-dimensional random vector X. We assume that this function is given by (1) for some fixed \ninteger d, the random vector X and random variable ~ are defined on the same probability space, \nthat E [~IX) = 0, and (12(X) = Var{\u20aclX) = constant < 00 almost surely. The smallest eigenvalue of \nthe matrix 1fJ(x)1fJ(x ) is assumed to be bounded from below by the inverse of some square integrable \nfunction. \n\nNote that it can be shown that the VC-dirnension of the class of cI>-machines with d neurons \nis d under the last assumption. The learning-theoretic properties of the system will be determined \nlargely by the eigen structure of cI>. Accordingly, let >'1 ~ >'2 ~ ... ~ >'d denote the eigenvalues of cI>. \nThe goal of the learning is to finei the true concept 0: given independently drawn examples (X, y) \nfrom (1). Given any hypothesis (vector) W = (WI, ... ,Wd)' for consideration as an approximation \nto the true concept 0:, the performance measure we use is the mean-square prediction (or ensemble) \nerror \n\n\u00a3(W) = E (Y -1/J(X)'w( \n\nNote that the true concept 0:* is the mean-square solution \n\n0:* = argmin\u00a3(w) = cI>-IE (1/J(X)y), \n\ntv \n\n\fOptimal Stopping and Effective Machine Complexity in Learning \n\n305 \n\nand the minimum predict.ion error is given by \u00a3(0) = lllinw E.(w) = u'l. \n\nLet 11. be the nUmbE'f of samples of (X,} -). WE' assume that an independent, ic\\entkally \ndistributed sample (X(1),y(J), ... , (x(n),y(n), generated according to the joint distribution \nof (X, Y) induced by (1), is provided to thE:' IE:'arner. To simplify notation, define thE' matrix \n'It == [\",,(X(l) \nIn \nanalogy with (2) define the empirical error 011 the sample by \n\n. . . \"\"(X(\",) ) and the corresponding vector of outputR y = (y(l), . . . , y(n))'. \n\nLet a denote the hypothesis vector for which t.he empirical error 011 the sample is minimized: \n'Vw\u00a3(o) = O. Analogously with (3) we can thell show that \n\n(4) \n\nwhere cj, = t- 'It 'It' is the empirical covariallre matrix, whirh is almost surely nonsingular for large n. \nThe terms in (4) are the empirical counterparts of the ensemble averages in (3). \n\nThe gradient descent algorithm is givf'n by: \n\nwhere 0 = (01,02, . . . ,03 )\" t is the number of iterations, and \u20ac \nwe can get \na, = (I - ~(t\u00bbo + ~(t.)oo, \n\nwhere ~(t) = (I - \u20acci\u00bbt, clIld 00 is the initial weight vector. \n\nis the rate of learning. From this \n\n(5) \n\n(6) \n\nThe limit of Ot is n when t. goes to infinity, provided ci> is positive definite and the learning rate \n\u20ac is small enough (Le., smaller than the smallest eigenvalue of ci\u00bb. This implies that the gradient \ndescent algorithm converges to the least squarE'S solution, starting from any point in Rn. \n\n3 GENERALIZATION DYNAMICS AND STOPPING TIME \n\n3.1 MAIN THEOREM OF GENERALIZATION DYNAMICS \n\nEven if the true concept (i.e., the precise relation between Y and X in the current problem) is in the \nclass of models we consider, it is usually hopeless to find it using only a finite number of examples, \nexcept in some trivial cases. Our goal is hf'nce less ambitious; we seek to find the best approximation \nof the true concept, the approach entailing a minimization of the training or empirical error, and \nthen taking the global minimum of the empirical error a as the approximation. As we have seen the \nprocedure is unbiased and consistent. Does this then imply that training should always be carried \nout to the limit? Surprisingly, the answer is 110. This assertion follows from the next theorem. \n\nTheorem 3.1 Let Mn > 0 be an arbitrary f'eal constant (possibly depending on 11.), and suppose \na.~s1tm.ptions Al to A.'I af'C satisfied; then the rwnrralizatioll dynamics in the training process are \ngOllerned \"y the follolllinq rquatioll.: \n\nuniformly for all initial weight ver.iors, no in the d-dim.ensional ball {n* + 8 : 11811 ~ M n , 8 E R d }, \nand for all t > O. \n0 \n\n\f306 \n\nWang, Venkatesh, and Judd \n\n3.2 THREE PHASES IN GENERALIZATION \n\nBy Theorem 3.1, the mean gelleralil':ation <'nor at each epoch of the t.raining proc\u00a3'ss is characterized \nby the following function: \n\n\u00a2J(t) == t ['\\,8~(1 - f'\\,)2' - 20- 2 (1 - \u20ac,\\;)t[1-\n\nn \n\n. \n,=1 \n\n~(1 - f,\\;)'I] . \n2 \n\nThe analysis of the evolution of generalization with training is facilitated by treating \u00a2J(.) as a \nfunction of a continuotls tinlf' parameter f. \n\\Ve will show that. there are three distinct phases in \ngeneralization dynamics. These results are givpn in the following in form by several corollaries of \nTheorem 3.1. \n\nWithout loss of geIH'rality, w<' assnllJ(' th<' init.ial w<'ight ve-ctm is pickerl 11)1 in a region with \n11811 ~ Mn = 0(11\u00b0), and in particular, 181 = O(I/ n ). L<'t 1\", = 11I(l/1:.~\u00b7.x\u00b7r1)f, the-II for all 0 S; t < f, _ \n2111(1;'t'1~ \u2022 .x.r1)' we have 0::; 7', < ~,and thus \n\nd \n\nL :.I \n\n'\\i\"; (1 -\n\n.=-1 \n\nf,\\r/) = O( --2 \u00bb> 0(-1-3 - . ) = -\n\n1 \n\n2' \n\n1 \n\"T, \n\nn'\" \n\n:.I \n\n20- L \n\ntI \n\nn \n\n(1 -\n\n;= 1 \n\n' \n\nf'\\;) . \n\nThe quantity 8; (1 - \u20acA;) 2t in the fi rst term of t he above inequalities is related to the elimination of \ninitial error, and can be defined as the approximation error (or fitting error); the last term is related \nto the effective complexity of the network at t (in fact, an order O( ~) shift of the complexity (,rror). \nThe definition and observations here will be discussed in more detail ill the next section. \n\nWe call the learning process during the time interval 0 ~ t S; tl the first phase of learning. \nSiuce ill this interval \u00a2J(t) = 0(,,-2,\u00b7,) is n lIlollutollkally df'Cfeasillg function of i, the g('neralil':atiou \nerror decreases monotonically ill the first phase of It'arning. At the end of first phase of learning \n\u00a2J(tl) = O( ~), therefore the generalization error is \u00a3(nt,) = [(nco) + O( ~). As a snmmary of these \nstatements we have the following mroJlary. \n\nCorollary 3.2 In the first phase of learning, the complexity error is dominated by the approximation \nerror, and within an order of O( ~) I the generalization eTTor decrea.5es monotonically in the lrarnin.q \nprocess to \u00a3 (noo) + O( ~) at the end of first pha.~e. \n0 \n\nFor f > t 1 , we can show by Thp.orem 3.1 t ha t t It(' g<'llcralizatioll dynamics is given by thp. following \nequation, where 8\" == n(tl) - n~, \n\n\u00a3a(at,+t) = \u00a3a(ao) -\n\n- L(l - f'\\i) 1 -\n\n, \n\n[ \n\n20-\n2 \nn \n\ncI \n,=1 \n\n1 \n- (1 + Pi) (1 - f'\\i) + O(n 2), \n2 \n\n_1 \n\n2 \n\n] \n\n, \n\nwhen~ p~ == ,\\j8;(tl )n./0-2 , which is, with probahility a.pproaching one, of ordPf O(nO). \n\nWithout causing confusion, WP. stillnse \u00a2J(-) for the new time-varying part of the gf'neralization \nerror. The function \u00a2J(.) has much more complex behavior after tl than in the first phase of learninr;. \nAs we will see, it decreases for some time, and finally begins to increase again. In particular, we \nfound the best generalization at tha.t t where \u00a2J( t) is minimized. (It is noted that 8tl is a random \nvariable now, and the following statements of the generalization dynamics are ill the sense of with \nprobability approaching one as n -+ 00.) \n\nDefine the optimal stopping tim<': \n\nf\",ill == argmin{\u00a3(a,) : t E [D,oo!}, i.e., the epoch corre(cid:173)\n\nsponding to the smallest gPllPralization Pfror. Then we can prove the following corollaries: \nCorollary 3.3 The optimal stoppin.q time t\",ill = O(ln 71.), p1'Ovided 0-2 > D. \nfollowing inequalities hold: \n\nIn particular, the \n\nd \n1. tt S; tmin ~ tl/, 111 tere tf = t\\ + nlIn, In(I/[1 -,.x,I) an \n\nb h \nttL = tl + max, 111(1/[1-,,\\.)) are ot \nfinite real numbers. Th at is, the smallest generalization occurs before the global minimum \nof the empirical err07' is 1\u00b7eachcd. \n\n. , \n\n-\n\n. \n\n-\n\n2 \nIn(I+Pj) \n\n2 \nIn(l+p,) \n\n\fOptimal Stopping and Effective Machine Complexity in Learning \n\n307 \n\n2. \u00a2C) (tmcking the gencmlizntio7! eTTor) decreases monotonically for t < tf and increases \n\nmonotonically tn zero for t > tu; fuf'thermore, tmin is unique if tt + \n\nIn 2 :x > tu. \n\nIn(I/[I- < I\u00bb -\n\n3. _\",2 \"d 1 --L,., < \u00a2(tmin) < _\",2 -2!L[...h.~d ]1' where'\" = 11l(1-<~I) and (12 _ \"d p2 \n- 0t=1 i' \n\nn 1+1' 1'+1 +p' \n\n,. 0t= H.pi -\n\n'In(I-( d)' \n\n-\n\nIn accordance with our earlier definitions, We' call the learning proeess during the time intl'rval \nbetween tl and t\" til(' s('cond pitas(' of l('aruinl1;; and the rest of timl' til(' third phasf' of learning. \n\nAccording to Corollary 3.3, for t > tlL sufficiently large, the gell('ralization error is uniformly \nbetter than at the globalminimuIn, a, of the empirieal error, although minimum generalization error \nis achieved betwel'n t f and tu. The generalization error is redllced hy at least. - \",2 -2!L [...h.+ AJ1' \nover that for a if we stop training at. a prop<'f time. For a fixed nUlnlwr of it'aming examples, \nthe larger is the ratio d/lI, the larger is til(' improvement in generalization error if the algorithm is \nstopped before the glohal minimum n\u00b0 is reariwd . \n\n,. 1+1' l' \n\nI n+p \n\n. \n\n4 THE EFFECTIVE SIZE OF THE MACHINE \n\nOur concentration on dynamics and our seeming disregard for complexity do not conflict with the \nlearning-theoretic focus on VC-dimension; in fact, the two attitudes fit nicely together. This section \nexplains the generalization dynamics by introducing the the concept of effective complexity of the \nmachine. It is argued that early stopping in drect sets the l'ffective size of the network to a value \nsmaller than its VC-dimension. \n\nThe effective size of the machine at time t is defined to be d(t) == L~=1 [1 -\n\n(1 - d.,)fJ2, which \nincreases monotonically to d, the VC-dimensioll of the network, as t -+ 00. This definition is justified \nafter the following theorem: \n\nTheorem 4.1 Under the a.5sumptions of Them'em 3.1, the following equation holds uniformly for \nnil no such that 1151 ~ 111 n, \n\nIn the limit of learning, we have by letting t -+ 00 in the above equation, \n\n\u00a3(a) =\u00a3(a*)+ ~d+O(n-~) \n\n2 \n\nn \n\n(7) \n\no \n\n(8) \n\nHence, to an order of O(n-1.5), the generalization error at the limit of training breaks into two parts: \nthe approximation error \u00a3(0 0 ), and the complexity error ~0'2 . Clearly, the latter is proportional to \nd, the VC-dimension of the network. For all d's larger than necessary, \u00a3(a*) remains a constant, \nand the generalization error is determined solely by ~. The term \u00a3(a.,t) differs from \u00a3(0*) only in \nterms of initial error, and is identified to be the approximation error at t. Comparison of the above \ntwo equations thus shows that it is reasonable to define ':. d(t) as the complexity error at t, and \njustifies the definition of d(t) as the effective size of the machine at the same time. The quantity \nd(t) captures the notion of the degree to which the capacity of the machine is used at t. It depends \non the machine parameters, the a.lgorithm being IIsed , and the marginal distribution of X. Thus, we \nsee from (7) that the generalization error at epoch t falls into the same two parts as it does at the \nlimit: the approximation error (fitting error) and the complexity error (determined by the effective \nsize of the machine). \n\n2 \n\nAs we have show in the last section, during the first phase of learning, the complexity error is of \nhigher order in n compared to the fitting error during the first phase of learning, if the initial error \nis of order O(nO) or largN . Thus derrpase of til(' fitting error (which is proportional to the training \nerror, as we will see in the next section) illl plies the decrpase of the generalization error. However, \n\n\f308 \n\nWang, Venkatesh, and Judd \n\nwhen the fitting error is brought down to the order O( ~), thE' decreas~ of fitting error will no longer \nimply th~ decreasE' of the' genc>rali?:ation error. In fact, by the ahoVf' t.heorem , the generali?:ation \nerror at t + tl can be written as \n\nThe fitting error and the complexity error compete at order O( ~) during the second phase oflearning. \nAfter the second the phase of icarning, th(' complexity error dominates the fitting error, still at tilE' \norder of O( ~) . Furthermore, if we define K == 1 ~. [#I d~lp2 J', then by the above equation and (3.3), \nwe have \n\nCorollary 4.2 At the optimal 8topping time, flip following u1J11er bound (m the generalization error \nholds, \n\nSince K is a quantity of order 0(71,\u00b0), (1 - K)d is strictly smaller than d. Thus stopping training at \ntmin has the same effed as using a smaller machine of size less than (1 - K)d and carrying training \nout to the limit! A more detailed analysis reveals how the effective size of the machine is affected \nby each neuron in thE' learning process (omitted dne to the space limit). \n\nREMARK: The concept of effE'ctive size of the machine can be defined similarly for an arbitrary \nstarting point. However, to compare the degree to which the capacity of the machine has been used \nat t, one must specify at what distance between the hypothesis a and the truth o' is such compari(cid:173)\nson started. While each point in the d-diuwnsional Euclidean space can be rega.rded as a hypothesis \n(machine) about 0*, it is intuitively dear that earh of these machines has a different capacity to \napproximate. it. But it is r('asonable to think that all of the machines that a.re 011 the same sphere \n{a : 10 - 0*1 = r}, for each ,. > 0, haW' the same capacity in approximating 0*. Thus, to compare \nthe capacity being llsed at t, we mllst specify a sl)('cifk sphere as the starting point; defining the \neffective size of the marhillc at t withont spedfying the starting sphere is clearly meallingless. As \nwe have seen, r ~ 7; is found to be a good choice for our purposes. \n\n5 NETWORK SIZE SELECTION \n\nThe next theorem relates the generalization error and training error at E'ach epoch of learning, and \nforms the basis for choosing the optimal stopping time as well as the best size of the machine during \nthe learning process. In the limit of the learning process, the criterion reduces to the well-known \nAkaike Information Criterion (AIC) for statistical model selection. Comparison of the two criteria \nreveals that our criterion will result in better generalization than AIC, since it incorporates the \ninformation of each individual neuron rather than just the total number of neurons as in the Ale. \n\nTheorem 5.1 A.9suming the learning algorithm converges, and the conditions of Theorem 3.1 are \nsatisfied; then the following equation holds: \n\n\u00a3 ( (t,) = (1 + () ( ] \u00bb E \u00a3 1I ( 0, ) + r( d, t) + 0 ( ~ ) \n\nIIIhr.rr r(d, t) = 2~~_ 2:7-1 [J -- (1 -- rAj)'1 \n\n(9) \n\no \n\nA(~('ording to this th('orl'lIl, We' find an M;.YIIlJllotically unbiased estimate of \u00a3(u,) to ht' \u00a3,,(0,) + \nC(d, t) when (J\"2 is known . This results in the following criterion for finding the optimal stopping \ntime and network size: \n\nmin{\u00a3n(at) + C(d, t) : d, t = 1,2, .. . } \n\n(10) \n\n\fOptimal Stopping and Effective Machine Complexity in Learning \n\n309 \n\nWhen t goes to infinity, the above criterion becomes: \n\nmin{\u00a3,,(&) + 2(12d : d = 1,2, . . . } \n\nn \n\n(11) \n\nlo,(X))2 = E(Y -\n\nwhich is the AIC for choosing the b!:'st siz!:' of networks. Therefore, (10) can be viewed as an \nextension of the AIC to the learning process. \nTo understand the differences, consider the case \nwhen ~ has standard normal distribution N(O, (12) . Under this assumption, the Maximum Likelihood \n(ML) estimation of the weiglJt vectors is the saine as the Mean Square estimation. The AIC was \nobtained by minimizing E !::~:: i~l, the K ullback-Leibler distance of the density function f 0 M L (X) \nwith aML being the ML estimation of n and that of the true density 10' This is equivalent. to \nfOML(X))2 (assuming the limit and the expectation \nminimizing Iimt--+ooE(Y -\nare interchangeable) . Now it is dear that while AIC chooses networks only at the limit of learning, \n(10) does this in the whole learning procef1s. Observe that the matrix 4' is now exactly the Fisher \nInformation Matrix of the density function f.,(X), and Ai is a measure of the capacity of 'ljJi in \nfitting the relation b!:'tween X and Y. Therefore Hllr criterion incorporates the information about \neach specific neuron provided by the Fisher Information Matrix, which is a measure of how well \nthe data fit the model. This implies that there are two aspects in finding the trade-off between the \nmodel complexity and the empirical error in order to minimize the generalization error: one is to \nhave the smallest number of neurons and the other is to minimize the utilization of each neuron . \nThe AIC (and in fact most statistical model selection criteria) are aimed at the former, while our \ncriterion incorporates the two aspects at the same time. We have seen in the earlier discussions that \nfor a given number of neurons, this is done by using the capacit.y of each neuron in fitting the data \nonly to the degree 1 -\n\n(1 - fA,)t\",;\" rather than to its limit. \n\n6 CONCLUDING REMARKS \n\nTo the best of our knowledge, the results described in this paper provide for the first time a precise \nlanguage to describe overtraining phenomena in [('arning machin!:'s such as neural networks. We \nhave studied formally the generalization process of a linear machine when it is trained with a \ngradient descent algorithm . The concept of effective size of a machine was introduced to break the \ngeneralization error into two parts: t.he approximation error and the error caused by a complexity \nterm which is proportional to effective size; the former decreases monotonically and the later increases \nmonotonically in the learning proress. When the machine is trained on a finite number of examples, \nthere are in general three distinct phases of l!:'arning according to the relative magnitude of the \nfitting and complexity errors. In particular, there exists an optimal stopping time tmin = O(lnn) \nfor minimizing generalization error which occurs before the global minimum of the empirical error \nis reached . These results lead to a generalization of the AIC in which the effect of certain network \nparameters and time of learning are together taken into account in the network size selection process. \n\nFor practical application of neural networks, these results demonstrate that training a network \nto its limits is not desirable. From the learning-theoretic- point of view, the concept of effective \ndimension of a network t!:'Us us that we need more than thp VC-dimension of a machine to describe \nthe generalization properties of a machine, excppt in the limit of learning. \n\nThe generalization of the AIC reveals some unknown factf1 ill statistical model selection theory: \nnamely, the generalization error of a network is affeded not only by the number of parameters \nbut also by the degree to which each parametf'r is act.ually used in the learning process. Occam's \nprinciple therefore stands in a subtler form: Make minimal ILse of the ca.pacity of a network for \nencoding the information provided by learning samples. \n\nOur results hold for weaker assumptions than were made herein about the distributions of X \nand~. The case of machines that have vector (rather than scalar) outputs is a simple generalization. \nAlso, our theorems have recently been generalized to the case of general nonlinear machines and are \nnot restricted to the squared error loss function. \n\nWhile the problem of inferring a rule from the observational data has been studied for a long \ntime in learning theory as well as in other context sHch (IS in Linear and Nonlinear Regression, the \n\n\f310 \n\nWang, Venkatesh, and Judd \n\nstudy of the problem as a dynamical process seems to open a new ave~ue for looking at the problem. \nMany problems are open. For example, it is interesting to know what could be learned from a finite \nnumber of examples in a finite number of itf'rations in the case where the size of the machine is not \nsmall compared to the sample size. \n\nAcknowledgments \n\nC. Wang thanks Siemens Corporate Research for slIpport during the summer of 1992 whE'n t.his \nresearch was initiated . Thp work of C. Wang aud S. Venkatesh has bf'en supported in part by thp \nAir Force Office of Srif'lIt.ific Rpsparrh unrler grant. F49620-93-1-0120. \n\nReferences \n\n[1) Akaike, H. (1974) Informat.ion theory and an extension of the maximum likelihood principle. \nSecond International Sym.lJosimn on Information Theory, Ed. B.N. Krishnaiah, North Holland, \nAmsterdam, 27-4l. \n\n[2) Baldi, P. and Y. Chauvin (1991) Temporal evolution of generalization during learning in linear \n\nnetworks . Neural Communication. 3,589-603. \n\n[3] Chow, G. C. (1981) A comparison of the information and posterior probability criteria for model \n\nselection. Journal of Econometrics 16, 21-34 . \n\n(4) Hansen, Lars Ka.i (1993) Stochastic linear learning: E'xact test and training error averages. \n\nNeural Network.~, 4, 393-396. \n\n[5] Haussler, D. (1989) Decision theoretical gE'neralization of the PAC model for neural networks \n\nand other learning applications. Preprillt. \n\n(6) Heskes, Tom M. and Bert Kappen (1991) Learning processes in neural networks . Physical Review \n\nA, Vol 44, No.4, 2718- 2726. \n\n[7) Kroght, Anders and John A. Herts Generalization in a linear percept ron in the presence of \n\nnoise. Preprint. \n\n[8) Nilsson, N. J. Learning Machine.5 . New York: McGraw Hill. \n(9) Pinelis, I., and S. Utev (1984) Estimates of moments of SUms of independent random variables. \n\nTheory of Probability and It.5 Application.5. 29 (1984) 574-577. \n\n[10] Rissanen, J . (1987) Stochastic complE'xity. J. Royal Stati.5tical Society. Series B, Vol. 49, No. 3, \n\n223-265. \n\n[l1J Schwartz, G. (1978) Estimating till' dimellsion of a model. Annals of Stati.stic.9 6, 461-464. \n[12] Sazonov, V. (1982). On the accuracy of normal approximation. Journal of multivariate analysis. \n\n12, 371-384. \n\n[13] Senatov, V. (1980) Uniform estimates of the rate of convergence in the multi-dimensional central \n\nlimit theorem. Theory of Probability and Its Applications. 25 (1980) 745-758. \n[14] Vapnik, V. (1992) Measuring the capacity of learning machines (I). Preprint. \n[15) Weigend, S.A. and Rllmelhart (1991) . Generalization through minimal networks with applica(cid:173)\n\ntion to forcasting. INTERFACE'91-23rd Symposium on the Interface: Computing Science and \nStatistics, ed. E. M., Keramidas, pp362-370. Interface Foundation of North America. \n\n\f", "award": [], "sourceid": 816, "authors": [{"given_name": "Changfeng", "family_name": "Wang", "institution": null}, {"given_name": "Santosh", "family_name": "Venkatesh", "institution": null}, {"given_name": "J.", "family_name": "Judd", "institution": null}]}*