{"title": "Neural Control for Rolling Mills: Incorporating Domain Theories to Overcome Data Deficiency", "book": "Advances in Neural Information Processing Systems", "page_first": 659, "page_last": 666, "abstract": null, "full_text": "Neural Control for Rolling Mills: Incorporating \nDomain Theories to Overcome Data Deficiency \n\nMartin Roscheisen \nComputer Science Dept. \n\nMunich Technical University \n\n8 Munich 40, FRG \n\nVolker Tresp \n\nCorporate R&D \n\nSiemens AG \n\n8 Munich 83, FRG \n\nReimar Hofmann \n\nComputer Science Dept. \n\nEdinburgh University \nEdinburgh, EH89A, UK \n\nAbstract \n\nIn a Bayesian framework, we give a principled account of how domain(cid:173)\nspecific prior knowledge such as imperfect analytic domain theories can be \noptimally incorporated into networks of locally-tuned units: by choosing \na specific architecture and by applying a specific training regimen. Our \nmethod proved successful in overcoming the data deficiency problem in \na large-scale application to devise a neural control for a hot line rolling \nmill. It achieves in this application significantly higher accuracy than \noptimally-tuned standard algorithms such as sigmoidal backpropagation, \nand outperforms the state-of-the-art solution. \n\n1 \n\nINTRODUCTION \n\nLearning in connectionist networks typically requires many training examples and \nrelies more or less explicitly on some kind of syntactic preference bias such as \"mini(cid:173)\nmal architecture\" (Rumelhart, 1988; Le Cun et ai., 1990; Weigend, 1991; inter alia) \nor a smoothness constraint operator (Poggio et ai., 1990), but does not make use of \nexplicit representations of domain-specific prior knowledge. If training data is defi(cid:173)\ncient, learning a functional mapping inductively may no longer be feasible, whereas \nthis may still be the case when guided by domain knowledge. Controlling a rolling \nmill is an example of a large-scale real-world application where training data is \nvery scarce and noisy, yet there exist much refined, though still very approximate, \nanalytic models that have been applied for the past decades and embody many \nyears of experience in this particular domain. Much in the spirit of Explanation-\n659 \n\n\f660 \n\nRoscheisen, Hofmann, and Tresp \n\nBased Learning (see, for example, Mitchell et ai., 1986; Minton et ai., 1986), where \ndomain knowledge is applied to get valid generalizations from only a few training \nexamples, we consider an analytic model as an imperfect domain theory from which \nthe training data is \"explained\" (see also Scott et ai., 1991; Bergadano et ai., 1990; \nTecuci et ai., 1990). Using a Bayesian framework, we consider in Section 2 the \noptimal response of networks in the presence of noise on their input, and derive, \nin Section 2.1, a familiar localized network architecture (Moody et ai., 1989,1990). \nIn Section 2.2, we show how domain knowledge can be readily incorporated into \nthis localized network by applying a specific training regimen. These results were \napplied as part of a project to devise a neural control for a hot line rolling mill, and, \nin Section 3, we describe experimental results which indicate that incorporating \ndomain theories can be indispensable for connectionist networks to be successful \nin difficult engineering domains. (See also references for one of our more detailed \npapers.) \n\n2 THEORETICAL FOUNDATION \n\n2.1 NETWORK ARCHITECTURE \n\nWe apply a Bayesian framework to systems where the training data is assumed to \nbe generated from the true model I, which itself is considered to be derived from a \ndomain theory b that is represented as a function. Since the measurements in our ap(cid:173)\nplication are very noisy and clustered, we took this as the paradigm case, and assume \nthe actual input X EJRd to be a noisy version of one of a small number (N) of proto(cid:173)\ntypical input vectors it, ... , ~EJRd where the noise is additive with covariance ma(cid:173)\ntrix~. The corresponding true output values I(it), . .. , I(~)E JR are assumed to be \ndistributed around the values suggested by the domain theory, b(it), ... , b(~) (vari(cid:173)\nance C7~rior). Thus, each point in the training data D := {(Xi, Yi); i = 1, ... , M} is \nconsidered to be generated as follows: Xi is obtained by selecting one of the t~ and \nadding zero-mean noise with covariance ~, and Yi is generated by adding Gaussian \nzero-mean noise with variance C7Jata to l(ik).l We determine the system's response \nO( x) to an input X to be optimal with respect to the expectation of the squared \nerror (MMSE-estimate): \n\nO(x) := argmin \u00a3((/(1'true) - 0(x))2). \n\no(x) \n\nThe expectation is given by \"L,~1 P(Ttrue = t: IX = x) . (/(4) - 0(x))2. Bayes' \nTheorem states that P(Ttrue = t;lx = x) = p(X = xlTtrue = 4) . P(7true = \n4) / p(X = x). Under the assumption that all 4 are equally likely, simplifying the \nderivative of the expectation yields \n\nIThis approach is related to Nowlan (1990) and MacKay (1991), but we emphasize the \ninfluence of different priors over the hypothesis space by giving preference to hypotheses \nthat are closer to the domain theory. \n\n\fNeural Control for Rolling Mills \n\n661 \n\nwhere Ci equals \u00a3(f(t:)ID), i.e. the expected value of f(t:) given that the training \ndata is exactly D. Assuming the input noise to be Gaussian and ~, unless otherwise \nnoted, to be diagonal, ~ = (8ij (Trhs,i,jS,d, the probability density of X under the \nassumption that Ttrue equals 4 is given by \n\np(X = xlTtrue = ~) = (27r)d/}'I~ll/2 exp [ -~(x - 4)t ~-1 (x - 4)] \n\nwhere 1.1 is the determinant. The optimal response to an input x can now be \nwritten as \n\nO(x) = 2::~~exp[-t(X - t;)t ~-1 (x - t:)] . Ci \n\n2::i=l exp[-~(x - t:)t ~-1 (x - t:)] \n\n(1) \n\nEquation 1 corresponds to a network architecture with N Gaussian Basis Functions \n(GBFs) centered at 4, k = 1, ... ,N, each of which has a width (Ti, i = 1, ... ,d, \nalong the i-th dimension, and an output weight Ck. This architecture is known \nto give smooth function approximations (Poggio et al., 1990; see also Platt, 1990), \nand the normalized response function (partitioning-to-one) was noted earlier in \nstudies by Moody et al. (1988, 1989, 1990) to be beneficial to network performance. \nCarving up an input space into hyperquadrics (typically hyperellipsoids or just \nhyperspheres) in this way suffers in practice from the severe drawback that as soon \nas the dimensionality of the input is higher, it becomes less feasible to cover the \nwhole space with units of only local relevance (\"curse of dimensionality\"). The \nnormalized response function has an essentially space-filling effect, and fewer units \nhave to be allocated while, at the same time, most of the locality properties can be \npreserved such that efficient ball tree data structures (Omohundro, 1991) can still \nbe used. If the distances between the centers are large with respect to their widths, \nthe nearest-neighbor rule is recovered. With decreasing distances, the output of the \nnetwork changes more smoothly between the centers. \n\n2.2 TRAINING REGIMEN \n\nThe output weights Ci are given by \n\nCi = \u00a3(f(t:)ID) = 1: z\u00b7 p(f(t:) = zlD) dz. \n\nBayes' Theorem states that p(f(i;) = zlD) = p(Dlf(i;) = z) . p(f(i;) = z) / p(D). \nLet M (i) denote the set of indices j of the training data points (x j , Yj) that were \ngenerated by adding noise to (i;, f( ii)), i. e. the points that \"originated\" from ii. \nNote that it is not known a priori which indices a set M (i) contains; only posterior \nprobabilities can be given. By applying Bayes' Theorem and by assuming the \nindependence between different locations t:, the coefficients Ci can be written as2 \n\nCi = \n\n00 \n\nn \nJ \nz\u00b7 00 J n \n\n-00 \n\n-00 \n\nmEM(i) exp -2' O'~ata \n\n[ 1 (Z_y\",)2] \n[_l(v-YmP] \n\n0'2 \n\n2 \n\ndata \n\nmEM(i) exp \n\nexp[_l (Z-b(i'.))2] \n\n2 \n\n0'2 \n\nprior \n\ndz. \n\nexp[_1(V-b(i'.))2] dv \n\n2 \n\n0'2 \n\nprior \n\n2The normalization constants of the Gaussians in numerator and denominator cancel \nas well as the product for all m~M(i) of the probabilities that (Xm, Ym) is in the data set. \n\n\f662 \n\nRoscheisen, Hofmann, and Tresp \n\nIt can be easily shown that this simplifies to \n\nLmEM(i) Ym + k . b(t:) \n\nIM(i)1 + k \n\nCi = \n\n(2) \n\nwhere k = uJata/ Uirior and I\u00b7 I denotes the cardinality operator. In accordance \nwith intuition, the coefficients Ci turn out to be a weighted mean between the value \nsuggested by the domain theory b and the training data values which originated \nfrom t:. The weighting factor k/(IM(i)1 + k) reflects the relative reliability of the \ntwo sources of information, the empirical data and the prior knowledge. \nDefine Si as Si = (Ci - b(ik)) . k + LmEM(i)(Ci - Ym). Clearly, if ISil is minimized \nto 0, then Ci reaches exactly the optimal value as it is given by equation 2. An \nadaptive solution to this is to update Ci according to Ci = -'\"'( . Si. Since the \nmembership distribution for M( i) is not known a priori, we approximate it using a \nposterior estimate of the probability p(m E M(i)lxm) that m is in M(i) given that \nxm was generated by some center t~, which is \n\n( E M( ')I- ) -\npm \n-\n\nZ Xm \n\np(X = xmlTtrue = t:) \n\nLk=l p(X = xmlTtrue = tk) \n\nM \n\n_ \n\n_ \n\n. \n\np(X = xmlTtrtLe = t:) is the activation acti of the i-th center, when the network is \npresented with input xm . Substituting the equation in the sum of Si leads to the \nfollowing training regimen: Using stochastic sample-by-sample learning, we present \nin each training step with probability 1 - A a data point Yi, and with probability A \na point b(ik) that is generated from the domain theory, where A is given by \n\nk\u00b7N \n\nA:= k.N+M \n\n(3) \n\n(Recall that M is the total number of data points, and N is the number of centers.) \nA varies from 0 (the data is far more reliable than the prior knowledge) to 1 (the \ndata is unreliable in comparison with the prior knowledge). Thus, the change of Ci \nafter each presentation is proportional to the error times the normalized activation \nof the i-th center, acti / Lf=l actk. \nThe optimal positions for the centers t: are not known in advance, and we therefore \nperform standard LMS gradient descent on t:, and on the widths Ui. The weight \nupdates in a learning step are given by a discretization of the following dynamic \nequations (i=l, ... ,N; j=l, ... ,d): \n\n. \nt\u00b7\u00b7 - 2\", . ~ . act \u00b7 . \n\nZJ \n\n-\n\nI \n\nZ \n\nCi - O(i) \n1 \n\"\",N \n2 \nL...,.k=l actk Uii \n\n. -\n\n. (x\u00b7 - t .. ) \n\nZJ \n\nJ \n\n( 1) \n\n-2-\nuij \n\n= -'\"'( . ~. acti . \n\nCi-O(X) \nLk=l actk \n\nN \n\n. (xi -\n\n2 \ntii) \n\nwhere ~ is the interpolation error, acti is the (forward-computed) activity of the \nthe i-th center, and tii and Xi are the j-th component of t: and x respectively. \n\n\fNeural Control for Rolling Mills \n\n663 \n\n3 APPLICATION TO ROLLING MILL CONTROL \n\n3.1 THE PROBLEM \n\nIn integrated steelworks, the finishing train of the hot line rolling mill transforms \npreprocessed steel from a casting successively into a homogeneously rolled steel(cid:173)\nplate. Controlling this process is a notoriously hard problem: The underlying phys(cid:173)\nical principles are only roughly known. The values ofthe control parameters depend \non a large number of entities, and have to be determined from measurements that \nare very noisy, strongly clustered, \"expensive,\" and scarce. 3 On the other hand, \nreliability and precision are at a premium. Unreasonable predictions have to be \navoided under any circumstances, even in regions where no training data is avail(cid:173)\nable, and, by contract, an extremely high precision is required: the rolling tolerance \nhas to be guaranteed to be less than typically 20j.tm, which is substantial, partic(cid:173)\nularly in the light of the fact that the steel construction that holds the rolls itself \nexpands for several millimeters under a rolling pressure of typically several thou(cid:173)\nsands of tons. The considerable economic interest in improving adaptation methods \nin rolling mills derives from the fact that lower rolling tolerances are indispensable \nfor the supplied industry, yet it has proven difficult to remain operational within \nthe guaranteed bounds under these constraints. \n\nThe control problem consists of determining a reduction schedule that specifies for \neach pair of rolls their initial distance such that after the final roll pair the desired \nthickness of the steel-plate (the actual feedback) is achieved. This reinforcement \nproblem can be reduced to a less complex approximation problem of predicting the \nrolling force that is created at each pair of rolls, since this force can directly and \nprecisely be correlated to the reduction in thickness at a roll pair by conventional \nmeans. Our task was therefore to predict the rolling force on the basis of nine \ninput variables like temperature and rolling speed, such that a subsequent con(cid:173)\nventional high-precision control can quickly reach the guaranteed rolling tolerance \nbefore much of a plate is lost. \n\nThe state-of-the-art solution to this problem is a parameterized analytic model \nthat considers nine physical entities as input and makes use of a huge number \nof tabulated coefficients that are adapted separately for each material and each \nthickness class. The solution is known to give only approximate predictions about \nthe actual force, and although the on-line corrections by the high-precision control \nare generally sufficient to reach the rolling tolerance, this process necessarily takes \nmore time, the worse the prediction is-resulting in a waste of more of the beginning \nof a steel-plate. Furthermore, any improvement in the adaptation techniques will \nalso shorten the initialization process for a rolling mill, which currently takes several \nmonths because of the poor generalization abilities of the applied method to other \nthickness classes or steel qualities. \n\nThe data for our simulations was drawn from a rolling mill that was being installed \nat the time of our experiments. It included measurements for around 200 different \nsteel qualities; only a few qualities were represented more than 100 times. \n\n3The costs for a single sheet of metal-giving three useful data points that have to \nbe measured under difficult conditions-amount to a six-digit dollar sum. Only a limited \nnumber of plates of the same steel quality is processed every week, causing the data scarcity. \n\n\f664 \n\nRoscheisen, Hofmann, and Tresp \n\n3.2 EXPERIMENTAL RESULTS \n\nAccording to the results in Section 2, a network of the specified localized archi(cid:173)\ntecture was trained with data (artificially) generated from the domain theory and \ndata derived from on-line measurements. The remaining design considerations for \narchitecture selection were based on the extent to which a network had the capacity \nto represent an instantiation of the analytic model (our domain theory): \n\nTable 1 shows the approximation error of partitioning-to-one architectures with dif(cid:173)\nferent degrees of freedom on their centers' widths. The variances of the GBFs were \neither all equal and not adapted (GBFs with constant widths), or adapted individ(cid:173)\nually for all centers (GBFs with spherical adaptation), or adapted individually for \nall centers and every input dimension-leading to axially oriented hyperellipsoids \n(GBFs with ellipsoidal adaptation). Networks with \"full hyperquadric\" GBFs, for \n\nNormalized Error Maximum Error \n\nMethod \n\nGBFs with partitioning \nconstant widths \nspherical adaptation \nellipsoidal \" \n\nG BFs no partitioning \nMLP \n\nSquares [10- 2] \n\n0.40 \n0.18 \n0.096 \n0.85 \n0.38 \n\n[10- 2] \n\n2.1 \n1.7 \n0.41 \n5.3 \n3.4 \n\nTable 1: Approximation of an instantiation of the domain theory: localized archi(cid:173)\ntectures (GBFs) and a network with sigmoidal hidden units (MLP). \n\nwhich the covariance matrix is no longer diagonal, were also tested, but performed \nclearly worse, apparently due to too many degrees of freedom. The table shows \nthat the networks with \"ellipsoidal\" GBFs performed best. Convergence time of \nthis type of network was also found to be superior. The table also gives the com(cid:173)\nparative numbers for two other architectures: GBFs without normalized response \nfunction achieved significantly lower accuracy-even if they had far more centers \n{performance is given for a net with 81 centers)-than those with partitioning and \nonly 16 centers. Using up to 200 million sample presentations, sigmoidal networks \ntrained with standard backpropagation (Rumelhart et al., 1986) achieved a yet \nlower level-despite the use of weight-elimination (Le Cun, 1990), and an analysis \nof the data's eigenvalue spectrum to optimize the learning rate (see also Le Cun, \n1991). The indicated numbers are for networks with optimized numbers of hidden \nunits. \nThe value for A was determined according to equation 3 in Section 2.2 as A = 0.8; \nthe noise in our application could be easily estimated, since there are multiple \nmeasurements for each input point available and the reliability of the domain theory \nis known. Applying the described training regimen to the GBF-architecture with \nellipsoidal adaptation led to promising results: \n\nFigure 1 shows the points in a \"slice\" through a specific point in the input space: the \nmeasurements, the force as it is predicted by the analytic model and the network. It \ncan be seen that the net exhibits fail-safe behavior: it sticks closely to the analytic \nmodel in regions where no data is available. If data points are available and suggest \n\n\fNeural Control for Rolling Mills \n\n665 \n\nForce \n\nForce \n\nModel \n\nI \nt Net \n\nMOdel \n\nI \nt Net \n\n\\ \n\nModel \n\nthickness \n\nthickneSS \n\ntemperature \n\nFigure 1: Prediction of the rolling force by the state-of-the-art model, by the neural \nnetwork, and the measured data points as a function of the input 'sheet thickness,' \nand 'temperature.' \n\nMethod \nGaussian Units A = 0.8 \nGaussian Units A = 0.4 \nMLP \n\nPercent of Improvement Percent of Improvement \n\non Trained Samples \n\n18 \n41 \n3.9 \n\nat Generalization \n\n16 \n14 \n3.1 \n\nTable 2: Relative improvement of the neural network solutions with respect to the \nstate-of-the-art model: on the training data and on the cross-validation set. \n\na different force, then the network modifies its output in direction of the data. \n\nTable 2 shows to what extent the neural network method performed superior to \nthe currently applied state-of-the-art model (cross-validated mean). The numbers \nindicate the relative improvement of the mean squared error of the network solution \nwith respect to an optimally-tuned analytic model. Although the data set was very \nsparse and noisy, it was nevertheless still possible to give a better prediction. The \neffect is also shown if a different value for A were chosen: the higher value of A, that \nis, more prior knowledge, keeps the net from memorizing the data, and improves \ngeneralization slightly. In case of the sigmoidal network, A was simply optimized \nto give the smallest cross-validation error. When trained without prior knowledge, \nnone of the architectures lead to an improvement. \n\n4 CONCLUSION \n\nIn a large-scale applications to devise a neural control for a hot line rolling mill, \ntraining data turned out to be insufficient for learning to be feasible that is only \nbased on syntactic preference biases. By using a Bayesian framework, an imperfect \ndomain theory was incorporated as an inductive bias in a principled way. The \nmethod outperformed the state-of-the-art solution to an extent which steelworks \nautomation experts consider highly convincing. \n\n\f666 \n\nRoscheisen, Hofmann, and Tresp \n\nAcknowledgements \n\nThis paper describes the first two authors' joint university project, which was supported \nby grants from Siemens AG, Corporate R&D, and Studienstiftung des deutschen Volkes. \nH. Rein and F. Schmid of the Erlangen steelworks automation group helped identify the \nproblem and sampled the data. W. Buttner and W. Finnoff made valuable suggestions. \n\nReferences \n\nBergadano, F. and A. Giordana (1990). Guiding Induction with Domain Theories. In: Y. \n\nKodratoff et al. (eds.), Machine Learning, Vol. 3, Morgan Kaufmann. \n\nCun, Y. Le, J. S. Denker, and S. A. Solla (1990). Optimal Brain Damage. In: D. S. Touret(cid:173)\nzky (ed.), Advances in Neural Information Processing Systems 2, Morgan Kaufmann. \nCun. Y. Le, I. Kanter and S. A. Solla (1991). Second Order Properties of Error Surfaces: \n(eds.), Advances in \n\nLearning Time and Generalization. In: R. P. Lippman et al. \nNeural Information Processing 3, Morgan Kaufmann. \n\nDarken, Ch. and J. Moody (1990). Fast adaptive k-means clustering: some empirical \n\nresults. In: Proceedings of the IlCNN, San Diego. \n\nDuda, R. O. and P. E. Hart (1973). Pattern Classification and Scene Analysis. NY: Wiley. \nMacKay, D. (1991). Bayesian Modeling. Ph.D. thesis, Caltech. \nMinton, S. N., J. G. Carbonell et al. (1989). Explanation-based Learning: A problem(cid:173)\n\nsolving perspective. Artificial Intelligence, Vol. 40, pp. 63-118. \n\nMitchell, T. M., R. M. Keller and S. T. Kedar-Cabelli (1986). Explanation-based Learning: \n\nA unifying view. Machine Learning, Vol. 1, pp. 47-80. \n\nMoody, J. (1990). Fast Learning in Multi-Resolution Hierarchies. In: D. S. Touretzky \n(ed.), Advances in Neural Information Processing Systems 2, Kaufmann, pp. 29-39. \nMoody, J. and Ch. Darken (1989). Fast Learning in Networks of Locally-tuned Processing \n\nUnits. Neural Computation, Vol. 1, pp. 281-294, MIT. \n\nMoody, J. and Ch. Darken (1988). Learning with Localized Receptive Fields. In: D. \n\nTouretzky et al. (eds.), Proc. of Connectionist Models Summer School, Kaufmann. \n\nNowlan, St. J. (1990). Maximum Likelihood Competitive Learning. In: D. S. Touretzky \n\n(ed.,) Advances in Neural Information Processing Systems 2, Morgan Kaufmann. \n\nOmohundro, S. M. (1991). Bump Trees for Efficient Function, Constraint, and Classifi(cid:173)\n\ncation Learning. In: R. P. Lippman et al. (eds.), Advances in Neural Information \nProcessing 3, Morgan Kaufmann. \n\nPlatt, J. (1990). A Resource-Allocating Network for Function Interpolation. In: D. S. \nTouretzky (ed.), Advances in Neural Information Processing Systems 2, Kaufmann. \nPoggio, T. and F. Girosi (1990). A Theory of Networks for Approximation and Learning. \n\nA.I. Memo No. 1140 (extended in No. 1167 and No. 1253), MIT. \n\nRoscheisen, M., R. Hofmann, and V. Tresp (1992). Incorporating Domain-Specific Prior \n\nKnowledge into Networks of Locally-Tuned Units. In: S. Hanson et al.(eds.), Com(cid:173)\nputational Learning Theory and Natural Learning Systems, MIT Press. \n\nRumelhart, D. E., G. E. Hinton, and R. J. Williams (1986). Learning representations by \n\nback-propagating errors. Nature, 323(9):533-536, October. \nRumelhart, D. E. (1988). Plenary Address, IJCNN, San Diego. \nScott, G.M., J. W. Shavlik, and W. H. Ray (1991). Refining PID Controllers using Neural \n\nNetworks. Technical Report, submitted to Neural Computation. \n\nTecuci, G. and Y. Kodratoff (1990). Apprenticeship Learning in Imperfect Domain The(cid:173)\nories. In: Y. Kodratoff et al. (eds.), Machine Learning, Vol. 3, Morgan Kaufmann. \n\nWeigend, A. (1991). Connectionist Architectures for Time-Series Prediction of Dynamical \n\nSystems. Ph.D. thesis, Stanford. \n\n\f", "award": [], "sourceid": 447, "authors": [{"given_name": "Martin", "family_name": "R\u00f6scheisen", "institution": null}, {"given_name": "Reimar", "family_name": "Hofmann", "institution": null}, {"given_name": "Volker", "family_name": "Tresp", "institution": null}]}