{"title": "Use of a Multi-Layer Perceptron to Predict Malignancy in Ovarian Tumors", "book": "Advances in Neural Information Processing Systems", "page_first": 978, "page_last": 984, "abstract": null, "full_text": "Reinforcement Learning for Continuous \n\nStochastic Control Problems \n\nRemi Munos \n\nCEMAGREF, LISC, Pare de Tourvoie, \nBP 121, 92185 Antony Cedex, FRANCE. \n\nRerni.Munos@cemagref.fr \n\nPaul Bourgine \n\nEcole Polyteclmique, CREA, \n\n91128 Palaiseau Cedex, FRANCE. \n\nBourgine@poly.polytechnique.fr \n\nAbstract \n\nThis paper is concerned with the problem of Reinforcement Learn(cid:173)\ning (RL) for continuous state space and time stocha.stic control \nproblems. We state the Harnilton-Jacobi-Bellman equation satis(cid:173)\nfied by the value function and use a Finite-Difference method for \ndesigning a convergent approximation scheme. Then we propose a \nRL algorithm based on this scheme and prove its convergence to \nthe optimal solution. \n\n1 \n\nIntroduction to RL in the continuous, stochastic case \n\nThe objective of RL is to find -thanks to a reinforcement signal- an optimal strategy \nfor solving a dynamical control problem. Here we sudy the continuous time, con(cid:173)\ntinuous state-space stochastic case, which covers a wide variety of control problems \nincluding target, viability, optimization problems (see [FS93], [KP95])}or which a \nformalism is the following. The evolution of the current state x(t) E 0 (the state(cid:173)\nspace, with 0 open subset of IRd ), depends on the control u(t) E U (compact subset) \nby a stochastic differential equation, called the state dynamics: \n\ndx = f(x(t), u(t))dt + a(x(t), u(t))dw \n\n(1) \nwhere f is the local drift and a .dw (with w a brownian motion of dimension rand \n(j a d x r-matrix) the stochastic part (which appears for several reasons such as lake \nof precision, noisy influence, random fluctuations) of the diffusion process. \nFor initial state x and control u(t), (1) leads to an infinity of possible traj~tories \nx(t). For some trajectory x(t) (see figure I)., let T be its exit time from 0 (with \nthe convention that if x(t) always stays in 0, then T = 00). Then, we define the \nfunctional J of initial state x and control u(.) as the expectation for all trajectories \nof the discounted cumulative reinforcement : \n\nJ(x; u(.)) = Ex,u( .) {loT '/r(x(t), u(t))dt +,,{ R(X(T))} \n\n\f1030 \n\nR. Munos and P. Bourgine \n\nwhere rex, u) is the running reinforcement and R(x) the boundary reinforcement. \n'Y is the discount factor (0 :S 'Y < 1). In the following, we assume that J, a are of \nclass C2 , rand Rare Lipschitzian (with constants Lr and LR) and the boundary \n80 is C2 . \n\n\u00b7 all \u00b7 \n\n\u2022 \n\nII \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \nxirJ \n\u2022 \n\n\u2022 \n\nFigure 1: The state space, the discretized ~6 (the square dots) and its frontier 8~6 \n(the round ones). A trajectory Xk(t) goes through the neighbourhood of state ~. \n\nRL uses the method of Dynamic Program~ing (DP) which generates an optimal \n(feed-back) control u*(x) by estimating the value function (VF), defined as the \nmaximal value of the functional J as a function of initial state x : \n\nVex) = sup J(x; u(.). \n\nu( .) \n\n(2) \n\nIn the RL approach, the state dynamics is unknown from the system ; the only \navailable information for learning the optimal control is the reinforcement obtained \nat the current state. Here we propose a model-based algorithm, i.e. that learns \non-line a model of the dynamics and approximates the value function by successive \niterations. \n\nSection 2 states the Hamilton-Jacobi-Bellman equation and use a Finite-Difference \n(FD) method derived from Kushner [Kus90] for generating a convergent approxi(cid:173)\nmation scheme. In section 3, we propose a RL algorithm based on this scheme and \nprove its convergence to the VF in appendix A. \n\n2 A Finite Difference scheme \n\nHere, we state a second-order nonlinear differential equation (obtained from the DP \nprinciple, see [FS93J) satisfied by the value function, called the Hamilton-Jacobi(cid:173)\nBellman equation. \n\nLet the d x d matrix a = a.a' (with' the transpose of the matrix). We consider \nthe uniformly pambolic case, Le. we assume that there exists c > 0 such that \nV$ E 0, Vu E U, Vy E IRd ,2:t,j=l aij(x, U)YiYj 2: c1lY112. Then V is C2 (see [Kry80J). \nLet Vx be the gradient of V and VXiXj its second-order partial derivatives. \n\nTheorem 1 (Hamilton-Jacohi-Bellman) The following HJB equation holds : \n\nVex) In 'Y + sup [rex, u) + Vx(x).J(x, u) + ! 2:~j=l aij VXiXj (x)] = 0 for x E 0 \nBesides, V satisfies the following boundary condition: Vex) = R(x) for x E 80. \n\nuEU \n\n\fReinforcement Learningfor Continuous Stochastic Control Problems \n\n1031 \n\nRemark 1 The challenge of learning the VF is motivated by the fact that from V, \nwe can deduce the following optimal feed-back control policy: \n\nu*(x) E arg sup [r(x, u) + Vx(x).f(x, u) + ! L:7,j=l aij VXiXj (x)] \n\nuEU \n\nIn the following, we assume that 0 is bounded. Let eI, ... , ed be a basis for JRd. \nLet the positive and negative parts of a function 4> be : 4>+ = ma.x(4),O) and \n4>- = ma.x( -4>,0). For any discretization step 8, let us consider the lattices: 8Zd = \n{8. L:~=1 jiei} where j}, ... ,jd are any integers, and ~6 = 8Zd n O. Let 8~6, the \nfrontier of ~6 denote the set of points {~ E 8Zd \\ 0 such that at least one adjacent \npoint ~ \u00b1 8ei E ~6} (see figure 1). \nLet U 6 cUbe a finite control set that approximates U in the sense: 8 ~ 8' => \nU6' C U6 and U6U6 = U. Besides, we assume that: Vi = l..d, \n\n(3) \n\nBy replacing the gradient Vx(~) by the forward and backward first-order finite(cid:173)\ndifference quotients: ~;, V(~) = l [V(~ \u00b1 8ei) - V(~)l and VXiXj (~) by the second(cid:173)\norder finite-difference quotients: \n\n~XiXi V(~) \n~;iXj V(~) = 2P[V(~ + 8ei \u00b1 8ej) + V(~ - 8ei =F 8ej) \n\n-b [V(~ + 8ei) + V(,' - 8ei) - 2V(O] \n\n-V(~ + 8ei) - V(~ - 8ei) - V(~ + 8ej) - V(~ - 8ej) + 2V(~)] \n\nin the HJB equation, we obtain the following : for ~ E :\u00a36, \nV6(~)In,+SUPUEUh {r(~,u) + L:~=1 [f:(~,u)'~~iV6(~) - fi-(~,U)'~;iV6(~) \n. V(C) + \" . . (at; (~,'U) ~ + . V(C) _ a~ (~,'U) ~ - . . V(C))] } = 0 \n\n+ aii (~.u) ~ . \n\n2 \n\nX,X,'\" \n\nwJ'l=~ \n\n2 \n\nx,x.1'\" \n\n2 \n\nx,xJ\n\n'\" \n\nKnowing that (~t In,) is an approximation of ( ,l:l.t -1) as ~t tends to 0, we deduce: \nSUPuEUh [,\"'(~'U)L(EEbP(~,U,()V6\u00ab()+T(~,u)r(~,u)] (4) \n(5) \n\nwith T(~, u) \n\nV6(~) \n\nwhich appears as a DP equation for some finite Markovian Decision Process (see \n[Ber87]) whose state space is ~6 and probabilities of transition: \n\np(~,u,~ \u00b1 8ei) \np(~, u, ~ + 8ei \u00b1 8ej) \np(~,u,~ - 8ei \u00b1 8ej) \np(~,u,() \n\n\"'~~r) [28Ift(~, u)1 + aii(~' u) - Lj=l=i laij(~, u)l] , \n\"'~~r)a~(~,u)fori=f:j, \n\"'~~r)a~(~,u) for i =f: j, \no otherwise. \n\n(6) \n\nThanks to a contraction property due to the discount factor\" there exists a unique \nsolution (the fixed-point) V to equation (4) for ~ E :\u00a36 with the boundary condition \nV6(~) = R(~) for ~ E 8:\u00a36. The following theorem (see [Kus90] or [FS93]) insures \nthat V 6 is a convergent approximation scheme. \n\n\f1032 \n\nR. Munos and P. Bourgine \n\nTheorem 2 (Convergence of the FD scheme) V D converges to V as 8 1 0 : \n\nlim /)10 VD(~) = Vex) un~formly on 0 \n\n~-x \n\nRemark 2 Condition (3) insures that the p(~, u, () are positive. If this condition \ndoes not hold, several possibilities to overcome this are described in [Kus90j. \n\n3 The reinforcement learning algorithm \n\nHere we assume that f is bounded from below. As the state dynami,:s (J and a) \nis unknown from the system, we approximate it by building a model f and a from \nsamples of trajectories Xk(t) : we consider series of successive states Xk = Xk(tk) \nand Yk = Xk(tk + Tk) such that: \n- \"It E [tk, tk + Tk], \nkN.8 for some positive constant kN, \n- the control u is constant for t E [tk, tk + Tk], \n- T k satisfies for some positive kl and k2, \n\nx(t) E N(~) neighbourhood of ~ whose diameter is inferior to \n\nThen incrementally update the model : \n\n.1 \",n Yk - Xk \nn ~k=l Tk \n\nan(~,u) \n\nn \n\n1 \n-;;; Lk=l \n\n(Yk - Xk - Tk.fn(~, u)) (Yk - Xk - Tk\u00b7fn(~, u))' \n\nTk \n\n(7) \n\n(8) \n\nand compute the approximated time T( x, u) ~d the approximated probabilities of \ntransition p(~, u, () by replacing f and a by f and a in (5) and (6). \nWe obtain the following updating rule of the V D -value of state ~ : \n\nV~+l (~) = sUPuEU/) [,~/:(x,u) L( p(~, u, ()V~(() + T(x, u)r(~, u)] \n\n(9) \n\nwhich can be used as an off-line (synchronous, Gauss-Seidel, asynchronous) or on(cid:173)\ntime (for example by updating V~(~) as soon as a trajectory exits from the neigh(cid:173)\nbourood of ~) DP algorithm (see [BBS95]). \nBesides, when a trajectory hits the boundary [JO at some exit point Xk(T) then \nupdate the closest state ~ E [JED with: \n\n(10) \n\nTheorem 3 (Convergence of the algorithm) Suppose that the model as well \nas the V D -value of every state ~ E :ED and control u E UD are regularly updated \n(respectively with (8) and (9)) and that every state ~ E [JED are updated with (10) \nat least once. Then \"Ie> 0, :3~ such that \"18 ~ ~, :3N, \"In 2: N, \n\nsUP~EE/) IV~(~) - V(~)I ~ e with probability 1 \n\n\fReinforcement Learningfor Continuous Stochastic Control Problems \n\n1033 \n\n4 Conclusion \n\nThis paper presents a model-based RL algorithm for continuous stochastic control \nproblems. A model of the dynamics is approximated by the mean and the covariance \nof successive states. Then, a RL updating rule based on a convergent FD scheme is \ndeduced and in the hypothesis of an adequate exploration, the convergence to the \noptimal solution is proved as the discretization step 8 tends to 0 and the number \nof iteration tends to infinity. This result is to be compared to the model-free RL \nalgorithm for the deterministic case in [Mun97]. An interesting possible future \nwork should be to consider model-free algorithms in the stochastic case for which a \nQ-Iearning rule (see [Wat89]) could be relevant. \n\nA Appendix: proof of the convergence \n\nLet M f ' M a, M fr. and Ma .\u2022 be the upper bounds of j, a, f x and 0' x and m f the lower \nbound of f. Let EO = SUP\u20acEI:h !V0 (';) - V(';)I and E! = SUP\u20acEI:b \\V~(';) - VO(.;)\\. \n\nA.I Estimation error of the model fn and an and the probabilities Pn \n\nSuppose that the trajectory Xk(t) occured for some occurence Wk(t) of the brownian \nmotion: Xk(t) = Xk + f!k f(Xk(t),u)dt + f!\" a(xk(t),U)dwk. Then we consider a \ntrajectory Zk (t) starting from .; at tk and following the same brownian motion: \nZk(t) ='; + fttk. f(Zk(t), u)dt + fttk a(zk(t), U)dWk' \nLet Zk = Zk(tk + Tk). Then (Yk - Xk) - (Zk -.;) = ftk [f(Xk(t), u) - f(Zk(t), u)] dt + \nftt:.+Tk [a(xk(t), u) - a(zk(t), u)J dWk. Thus, from the C1 property of f and a, \n\nII(Yk - Xk) - (Zk - ';)11 ~ (Mf'\" + M aJ.kN.Tk. 8. \n\n(11) \n\nThe diffusion processes has the following property ~ee for example the ItO-Taylor \nmajoration in [KP95j) : Ex [ZkJ = ';+Tk.f(';, U)+O(Tk) which, from (7), is equivalent \nto: Ex [z~:g] = j(';,u) + 0(8). Thus from the law of large numbers and (11): \n- \u00a5.] II + 0(8) \n\nli~-!~p Ilfn(';, u) - f(';, u)11 \n\n-\n\nli;;:s~p II~ L~=l [Yk;kX& \n(Mf:r: + M aJ\u00b7kN\u00b78 + 0(8) = 0(8) w.p. 1 (12) \n\nIlrk - ikll = (Mf:r: + M aJ.Tk.kN.8 + Tk.o(8) \n\nBesides, diffusion processes have the following property (again see [KP95J): \nEx [(Zk -.;) (Zk - .;)'] = a(';, U)Tk + f(';, u).f(';, U)'.T~ + 0(T2) which, from (7), \nis equivalent to: Ex [(Zk-\u20ac-Tkf(S'U)~(kZk-S-Tkf(S'U\u00bb/] = a(';, u) + 0(82). Let rk = \nZk -.; - Tkf(';, u) and ik = Yk - Xk - Tkfn(';, u) which satisfy (from (11) and (12\u00bb : \n(13) \nFrom the definition of Ci;;(';,u), we have: Ci;;(';,u) - a(';,u) = ~L~=l '\\:1.' -\nEx [r~':k] + 0(82 ) and from the law of large numbers, (12) and (13), we have: \n11~(';,u) - a(';,u)11 = li~-!~p II~ L~=l rJ./Y - r~':k II + 0(82 ) \nIlik -rkllli:!s!p~ fl (II~II + II~II) +0(82 ) = 0(82 ) \n\nli~~~p \n\n\f1034 \n\nR. Munos and P. Bourgine \n\n\"In(';, u) - I(';, u)\" ~ kf\u00b78 w.p. 1 \n1Ill;;(';, u) - a(';, u)1I ~ ka .82 w.p. 1 \n\nBesides, from (5) and (14), we have: \n\n1 (c ) _ -\nT r.\",U \n\nTn r.\",U _ \n\n)1 < d.(k[.6 2+d.k,,6 2 ) J:2 < k \n\n(d.m, .6)2 \n\nU \n\n_ \n\n(c \n\nJ:2 \nT'U \n\nand from a property of exponential function, \n\nI,T(~.u) _ ,7' .. (\u20ac .1\u00a3) I = kT.In ~ .82 . \n\nWe can deduce from (14) that: \n\n(14) \n\n(15) \n\n(16) \n\n. \nlimsupp';,u,( -Pn';,u,( ~ \nn-+oo \n\n) -( \n\n1 ( \n\n)1 \n\n(2.6.Mt+d.Ma)(2.kt+d.k,,)62 \n\n6mr(2.k,+d.ka)62 \n\nk J: \n\nS; puw.p.l \n\n(17) \n\nA.2 Estimation of IV~+l(';) - V6(.;) 1 \n\nMter having updated V~(';) with rule (9), \nIV~+l(';) - V6(.;) I. From (4), (9) and (8), \nA < \n\n,T(\u20ac.U) L: [P(';, u, () - p(.;, u, ()] V 6 (() + ( ,T(\u20ac.1\u00a3) - ,7'(~'1\u00a3\u00bb) L p(.;, u, () V 6 (() \n\nlet A denote \n\nthe difference \n\n( \n\n+,7' (\u20ac.u) . L:p(.;, u, () [V6(() - V~(()] + L:p(.;, u, ().T(';, u) [r(';, u) - F(';, u)] \n\n+ L:( p(.;, u, () [T(';, u) - T(';, u)] r(';, u) for all u E U6 \n\n( \n\n( \n\n( \n\nAs V is differentiable we have : Vee) = V(';) + VX ' (( - . ; ) + 0(1I( - ';11). Let \nus define a linear function V such that: Vex) = V(';) + VX ' (x - ';). Then \n[P(';, u, () - p(.;, u, ()] V 6(() = [P(';, u, () - p(.;, u, ()] . [V6(() - V(()] + \nwe have: \n[P(';,u,()-p(';,u,()]V((), thus: L:([p(';,u,()-p(';,u,()]V6(() = kp .E6.8 + \nL([P(';,U,()-p(.;,u,()] [V(() +0(8)] = [V(7J)-VUD] + kp .E6.8 + 0(8) = \n[V(7J) - V(1j)] + 0(8) with: 7J = L:( p(';, u, () (( -.;) and 1j = L:( p(.;, u, () (( - .;). \nBesides, from the convergence of the scheme (theorem 2), we have E6.8 = \n0(8). From the linearity of V, IV(() - V(Z) I ~ II( - ZII\u00b7Mv\", S; 2kp 82 . Thus \nIL( [P(';, u, () - p(.;, u, ()] V6 (() I = 0(8) and from (15), (16) and the Lipschitz prop(cid:173)\n\nerty of r, \n\nA = 1'l'(\u20ac'U), L:( p(.;, u, () [V6(() - V~ (()] 1+ 0(8). \n\nAs ,..,.7'(\u20ac.u) < 1 - 7'(\u20ac.U) In 1 < 1 _ T(\u20ac.u)-k.,.6 2 In 1 < 1 _ ( \nwe have: \n\n'Y -\n\n'Y -\n\n2 \n\n2 \n\nI \n\n-\n\n6 \n\n2d(M[+d.M,,) \n\nA = (1 - k.8)E~ + 0(8) \n\n_ \n\n!ix..82) In 1 \n'Y ' \n2 \n\n(18) \n\nwith k = 2d(M[~d.M,,). \n\n\fReinforcement Learning for Continuous Stochastic Control Problems \n\n1035 \n\nA.3 A sufficient condition for sUP\u20acEE~ IV~(~) - V6(~)1 :S C2 \n\nLet us suppose that for all ~ E ~6, the following conditions hold for some a > 0 \n\nE~ > C2 =} IV~+I(O - V6(~)1 :S E~ - a \nE~ :S c2=}IV~+I(~)_V6(~)I:Sc2 \n\n(19) \n(20) \nFrom the hypothesis that all states ~ E ~6 are regularly updated, there exists an \ninteger m such that at stage n + m all the ~ E ~6 have been updated at least \nonce since stage n. Besides, since all ~ E 8C6 are updated at least once with \nrule (10), V~ E 8C6, IV~(~) - V6(~)1 = IR(Xk(T)) - R(~)I :S 2.LR.8 :S C2 for any \n8 :S ~3 = 2~lR' Thus, from (19) and (20) we have: \n\n:S C2 =} E!+m :S C2 \nThus there exists N such that: Vn ~ N, E~ :S C2. \n\nE! > C2 =} E!+m :S E! - a \nE! \n\nA.4 Convergence of the algorithm \n\nLet us prove theorem 3. For any c > 0, let us consider Cl > 0 and C2 > 0 such that \nCl +C2 = c. Assume E~ > \u00a32, then from (18), A = E! - k.8'\u00a32+0(8) :S E~ -k.8.~ \nfor 8 :S ~3. Thus (19) holds for a = k.8.~. Suppose now that E~ :S \u00a32. From (18), \nA :S (1 - k.8)\u00a32 + 0(8) :S \u00a32 for 8 :S ~3 and condition (20) is true. \nThus for 8 :S min { ~1, ~2, ~3}, the sufficient conditions (19) and (20) are satisfied. \nSo there exists N, for all n ~ N, E~ :S \u00a32. Besides, from the convergence of the \nscheme (theorem 2), there exists ~o st. V8:S ~o, sUP\u20acEE~ 1V6(~) - V(~)I :S \u00a31\u00b7 \nThus for 8 :S min{~o, ~1, ~2, ~3}, \"3N, Vn ~ N, \n\nsup IV~(~) - V(~)I :S sup IV~(~) - V6(~)1 + sup 1V6(~) - V(~)I :S \u00a31 + c2 = \u00a3. \n\u20acEE6 \n\n\u20acEEh \n\n\u20acEE6 \n\nReferences \n\n[BBS95j Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to \nact using real-time dynamic programming. Artificial Intelligence, (72):81-\n138, 1995. \n\n[Ber87j Dimitri P. Bertsekas. Dynamic Programming: Deterministic and Sto(cid:173)\n\nchastic Models. Prentice Hall, 1987. \n\n[FS93j Wendell H. Fleming and H. Mete Soner. Controlled Markov Processes and \nViscosity Solutions. Applications of Mathematics. Springer-Verlag, 1993. \n[KP95j Peter E. Kloeden and Eckhard Platen. Numerical Solutions of Stochastic \n\nDifferential Equations. Springer-Verlag, 1995. \n\n[Kry80j N.V. Krylov. Controlled Diffusion Processes. Springer-Verlag, New York, \n\n1980. \n\n[Mun97j \n\n[Kus90j Harold J. Kushner. Numerical methods for stochastic control problems in \ncontinuous time. SIAM J. Control and Optimization, 28:999-1048, 1990. \nRemi Munos. A convergent reinforcement learning algorithm in the con(cid:173)\ntinuous case based on a finite difference method. International Joint Con(cid:173)\nference on Art~ficial Intelligence, 1997. \nChristopher J.C.H. Watkins. Learning from delayed reward. PhD thesis, \nCambridge University, 1989. \n\n[Wat89j \n\n\fUse of a Multi-Layer Percept ron to \n\nPredict Malignancy in Ovarian Tumors \n\nHerman Verrelst, \n\nDirk Timmerman \n\nYves Moreau and Joos Vandewalle \n\nDept. of Electrical Engineering \nKatholieke Universiteit Leuven \n\nKard. Mercierlaan 94 \n\nB-3000 Leuven, Belgium \n\nDept. of Obst. and Gynaec. \nUniversity Hospitals Leuven \n\nHerestraat 49 \n\nB-3000 Leuven, Belgium \n\nAbstract \n\nWe discuss the development of a Multi-Layer Percept ron neural \nnetwork classifier for use in preoperative differentiation between \nbenign and malignant ovarian tumors. As the Mean Squared clas(cid:173)\nsification Error is not sufficient to make correct and objective as(cid:173)\nsessments about the performance of the neural classifier, the con(cid:173)\ncepts of sensitivity and specificity are introduced and combined \nin Receiver Operating Characteristic curves. Based on objective \nobservations such as sonomorphologic criteria, color Doppler imag(cid:173)\ning and results from serum tumor markers, the neural network is \nable to make reliable predictions with a discriminating performance \ncomparable to that of experienced gynecologists. \n\n1 \n\nIntrod uction \n\nA reliable test for preoperative discrimination between benign and malignant ovar(cid:173)\nian tumors would be of considerable help to clinicians. It would assist them to select \npatients for whom minimally invasive surgery or conservative management suffices \nversus those for whom urgent referral to a gynecologic oncologist is needed. \nWe discuss the development of a neural network classifier/diagnostic tool. The neu(cid:173)\nral network was trained by supervised learning, based on data from 191 thoroughly \nexamined patients presenting with ovarian tumors of which 140 were benign and 51 \nmalignant. As inputs to the network we chose indicators that in recent studies have \nproven their high predictive value [1, 2, 3]. Moreover, we gave preference to those \nindicators that can be obtained in an objective way by any gynecologist. Some of \nthese indicators have already been used in attempts to make one single protocol or \ndecision algorithm [3, 4]. \n\n\fUse of a MLP to Predict Malignancy in Ovarian Tumors \n\n979 \n\nIn order to make reliable assessments on the practical performance of the classifier, \nit is necessary to work with other concepts than Mean Squared classification Error \n(MSE), which is traditionally used as a measure of goodness in the training of a \nneural network. We will introduce notions as specificity and sensitivity and combine \nthem into Receiver Operating Characteristic (ROC) curves. The use of ROC-curves \nis motivated by the fact that they are independent of the relative proportion of the \nvarious output classes in the sample population. This enables an objective validation \nof the performance of the classifier. We will also show how, in the training of the \nneural network, MSE optimization with gradient methods can be refined and/or \nreplaced with the help of ROC-curves and simulated annealing techniques. \nThe paper is organized as follows. In Section 2 we give a brief description of the \nselected input features. In Section 3 we state some drawbacks to the MSE criterion \nand introduce the concepts of sensitivity, specificity and ROC-curves. Section 4 \nthen deals with the technicalities of training the neural network. In Section 5 we \nshow the results and compare them to human performance. \n\n2 Data acquisition and feature selection \n\nThe data were derived from a study group of 191 consecutive patients who were \nreferred to a single institution (University Hospitals Leuven, Belgium) from August \n1994 to August 1996. Table 1 lists the different indicators which were considered, \ntogether with their mean value and standard deviations or together with the relative \npresence in cases of benign and malignant tumors. \n\nTable 1 \nDemographic \n\nIndicator \nAge \nPostmenopausal \n\nSerum marker CA 125 (log) \nCD! \nMorphologic \n\nBlood flow present \nAbdominal fluid \nBilateral mass \nUnilocular cyst \nMultiloc/solid cyst \nSmooth wall \nIrregular wall \nPapillations \n\nBenign Malignant \n49.3 \u00b1 16.0 58.3 \u00b1 14.3 \n\n40% \n\n70.6% \n\n2.8\u00b11.1 \n\n5.2 \u00b1 1.9 \n\n72.9% \n12.1% \n11.4% \n42.1% \n16.4% \n58.6% \n32.1% \n7.9% \n\n100% \n52.9% \n35.3% \n5.9% \n49.0% \n2.0% \n76.5% \n74.5% \n\nTable 1: Demographic, serum marker, color Doppler imaging and morphologic indicators. \nFor the continuous valued features the mean and standard deviation for each class are \nreported. For binary valued indicators, the last two columns give the presence of the \nfeature in both classes e.g. only 2% of malignant tumors had smooth walls. \n\nFirst, all patients were scanned with ultrasonography to obtain detailed gray-scale \nimages of the tumors. Every tumor was extensively examined for its morphologic \ncharacteristics. Table 1 lists the selected morphologic features: presence of ab(cid:173)\ndominal fluid collection, papillary structures (> 3mm), smooth internal walls, wall \nirregularities, whether the cysts were unilocular, multilocular-solid and/or present \non both pelvic sides. All outcomes are binary valued: every observation relates to \nthe presence (1) or absence (0) of these characteristics. \n\nSecondly, all tumors were entirely surveyed by color Doppler imaging to detect \npresence or absence of blood flow within the septa, cyst walls, solid tumor areas or \novarian tissue. The outcome is also binary valued (1/0). \n\n\f980 \n\nH. Verrelst, Y. Moreau, 1. Vandewalle and D. TImmennan \n\nThirdly, in 173 out of the total of 191 patients, serum CA 125 levels were measured, \nusing CA 125 II immunoradiometric assays (Centocor, Malvern, PA). The CA 125 \nantigen is a glycoprotein that is expressed by most epithelial ovarian cancers. The \nnumerical value gives the concentration in U Iml. Because almost all values were \nsituated in a small interval between 0 and 100, and because a small portion took \nvalues up to 30,000, this variable was rescaled by taking its logarithm. \n\nSince age and menopausal status of the patient are considered to be highly relevant, \nthese are also included. The menopausal score is -1 for premenopausal, + 1 for \npostmenopausal. A third class of patients were assigned a 0 value. These patients \nhad had an hysterectomy, so no menopausal status could be appointed to them. \nIt is beyond the scope of this paper to give a complete account of the meaning of \nthe different features that are used or the way in which the data were acquired. \nWe will limit ourselves to this short description and refer the reader to [2, 3] and \ngynecological textbooks for a more detailed explanation. \n\n3 Receiver Operating Characteristics \n\n3.1 Drawbacks to Mean Squared classification Error \nLet us assume that we use a one-hidden-Iayer feed-forward NN with m inputs xl, \nnh hidden neurons with the tanh(.) as activation function, and one output i1k, \n\nnh \n\nYk(B) = L Wj tanh(L VijX~ + {3j)' \n\nm \n\n(1) \n\nparameterized by the vector 0 consisting of the network's weights Wj and Vij and \nbias terms {3j. The cost function is often chosen to be the squared difference between \nthe desired dk and the actual response Yk. averaged over all N samples [12], \n\nj=l \n\ni=l \n\n1 N \n\nJ(O) = N 2:)dk - Yk(9))2. \n\nk=l \n\n(2) \n\nThis type of cost function is continuous and differentiable, so it can be used in \ngradient based optimization techniques such as steepest descent (back-propagation), \nquasi-Newton or Levenberg-Marquardt methods [8, 9, 11, 12]. However there are \nsome drawbacks to the use of this type of cost function. \nFirst of all, the MSE is heavily dependent on the relative proportion of \nthe different output classes in the training set. \nIn our dichotomic case this \ncan easily be demonstrated by writing the cost function, with superscripts b and m \nrespectively meaning benign and malignant, as \n\nJ(O) \n\n-\n\nNb \n\nNb + Nm Nb wk=l \n\n1 \"\"Nb (db \n\nk - Yk + Nb + N m N m wk=l \n\n1 \"\"Nm (dm \nk \n\n)2 \n\nNm \n\n)2 \n\n- Yk \n\n(3) \n\n~ ~ \n\n,\\ \n\n(1-,\\) \n\nIf the relative proportion in the sample population is not representative for reality, \nthe .x parameter should be adjusted accordingly. In practice this real proportion is \noften not known accurately or one simply ignores the meaning of .x and uses it as a \ndesign parameter in order to bias the accuracy towards one of the output classes. \n\nA second drawback of the MSE cost function is that it is not very in(cid:173)\nformative towards practical usage of the classification tool. A clinician is \nnot interested in the averaged deviation from desired numbers, but thinks in terms \nof percentages found, missed or misclassified. In the next section we will introduce \nthe concepts of sensitivity and specificity to express these more practical measures. \n\n\fUse of a MLP to Predict Malignancy in Ovarian Tumors \n\n981 \n\n3.2 Sensitivity, specificity and ROC-curves \n\nIf we take the desired response to be 0 for benign and 1 for malignant cases, the \nway to make clear cut (dichotomic) decisions is to compare the numerical outcome \nof the neural network to a certain threshold value T between 0 and 1. When the \noutcome is above the threshold T, the prediction is said to be positive. Otherwise the \nprediction is said to be negative. With this convention, we say that the prediction \nwas \n\nTrue Positive (TP) \nTrue Negative (TN) \nFalse Positive (FP) \nFalse Negative (FN) \n\nif the prediction was positive when the sample was malignant. \nif the prediction was negative when the sample was benign. \nif the prediction was positive when the sample was benign. \nif the prediction was negative when the sample was malignant. \n\nTo every of the just defined terms T P, TN, F P and F N, a certain subregion of the \ntotal sample space can be associated, as depicted in Figure 1. In the same sense, \nwe can associate to them a certain number counting the samples in each subregion. \nWe can then define sensitivity as Tl:FN' the proportion of malignant cases that \n\nTotal opulation \n\n\u00a5~li~nant \n.... \n. \" ... \n, \n\\ , , \n\" \n\nTP \n\n- '-\n\nFigure 1: The concepts of true and false positive and negative illustrated. The dashed \narea indicates the malignant cases in the total sample population. The positive prediction \nof an imperfect classification (dotted area) does not fully coincide with this sub area. \n\nare predicted to be malignant and specificity as F::r N' the proportion of benign \n\ncases that are predicted to be benign. The false positive rate is I-specificity. \nWhen varying the threshold T, the values of T P, TN, F P, F N and therefore \nalso sensitivity and specificity, will change. A low threshold will detect almost all \nmalignant cases at the cost of many false positives. A high threshold will give \nless false positives, but will also detect less malignant cases. Receiver Operating \nCharacteristic (ROC) curves are a way to visualize this relationship. The plot gives \nthe sensitivity versus false positive rate for varying thresholds T (e.g. Figure 2). \nThe ROC-curve is useful and widely used device for assessing and comparing the \nvalue of tests [5, 7]. The proportion of the whole area of the graph which lies below \nthe ROC-curve is a one-value measure of the accuracy of a test [6]. The higher \nthis proportion, the better the test. Figure 2 shows the ROC-curves for two simple \nclassifiers that use only one single indicator. (Which means that we classify a tumor \nbeing malignant when the value of the indicator rises above a certain value.) It is \nseen that the CA 125 level has high predictive power as its ROC-curve spans 87.5% \nof the total area (left Figure 2). For the age parameter, the ROC-curve spans \nonly 65.6% (right Figure 2). As indicated by the horizontal line in the plot, a CA \n125 level classification will only misclassify 15% of all benign cases to reach a 80% \nsensitivity, whereas using only age, one would then misclassify up to 50% of them. \n\n\f982 \n\nH. Verrelst, Y. Moreau, 1. Vandewalle and D. Timmennan \n\n::I( \n.. .. \n\n' f \n\n. , \n\n, r \n\no \u2022 \u2022\u2022 \n\n,J \n\n\" \n\n.\n\n, \n\n. . \n\nt\n\nf \n\n1 \n\nt, \n\n'1 \n\nU \n\nU \n\n\u2022\u2022 \n\n,.I \n\n.. \n\n0 ' \n\n\u2022\u2022 \n\n\" \n\n1 \n\nFigure 2: The Receiver Operating Characteristic (ROC) curve is the plot of the sensi(cid:173)\ntivity versus the false positive rate of a classifier for varying thresholds used. Only single \nindicators (left: CA 125, right: age) are used for these ROC-curlVes. The horizontal line \nmarks the 80% specificity level. \n\nSince for every set of parameters of the neural network the area under the ROC(cid:173)\ncurve can be calculated numerically, this one-value measure can also be used for \nsupervised training, as will be shown in the next Section. \n\n4 Simulation results \n\n4.1 \n\nInputs and architecture \n\nThe continuous inputs were standardized by subtracting their mean and dividing \nby their standard deviation (both calculated over the entire population). Binary \nvalued inputs were left unchanged. The desired outputs were labeled 0 for benign \nexamples, 1 for malignant cases. The data set was split up: 2/3 of both benign and \nmalignant samples were randomly selected to form the training set. The remaining \nexamples formed the test set. The ratio of benign to all examples is >. ~ j. \nSince the training set is not large, there is a risk of overtraining when too many \nparameters are used. We will limit the number of hidden neurons to nh = 3 or 5. \nAs the CA 125 level measurement is more expensive and time consuming, we will \ninvestigate two different classifiers: one which does use the CA 125 level and one \nwhich does not. The one-hidden-Iayer MLP architectures that are used, are 11-3-1 \nand 10-5-1. A tanh(.) is taken for the activation function in the hidden layer. \n\n4.2 Training \n\nA first way of training was MSE optimization using the cost function (3) . By taking \n>. = ~ in this expression, the role of malignant examples is more heavily weighted. \nThe parameter vector e was randomly initialized (zero mean Gaussian distribution, \nstandard deviation a = 0.01). Training was done using a quasi-Newton method with \nBFGS-update of the Hessian (fminu in Matlab) [8, 9]. To prevent overtraining, \nthe training was stopped before the MSE on the test set started to rise. Only few \niterations (~ 100) were needed. \n\nA second way of training was through the use of the area spanned by the ROC-curve \nof the classifier and simulated annealing techniques [10]. The area-measure AROC \nwas numerically calculated for every set of trial parameters: first the sensitivity \nand false positive rate were calculated for 1000 increasing values of the threshold \nT between 0 and 1, which gave the ROC-curve; secondly the area AROC under the \ncurve was numerically calculated with the trapezoidal integration rule. \n\n\fUse of a MLP to Predict Malignancy in Ovarian Tumors \n\n983 \n\nWe used Boltzmann Simulated Annealing to maximize the ROC-area. At time \nk a trial parameter set of the neural network OHl is randomly generated in the \nneighborhood of the present set Ok (Gaussian distribution, a = O.OO~. The trial \nset 8H1 is always accepted if the area Af.2? 2: Afoc. If Af'?? < Ak OC, Ok+! is \naccepted if \n\nA r:g? - A r;oc \n\n( \ne \n\nROC \n\nAk \n\n)/T. \n\n> Q \n\nwith Q a uniformly distributed random variable E [0,1] and Te the temperature. As \ncooling schedule we took \n\nTe = 1/(100 + 10k), \n\nso that the annealing was low-temperature and fast-cooling. The optimization was \nstopped before the ROC-area calculated for the test set started to decrease. Only \na few hundred annealing epochs were allowed. \n\n4.3 Results \n\nTable 2 states the results for the different approaches. One can see that adding the \nCA 125 serum level clearly improves the classifier's performance. Without it, the \nROC-curve spans about 96.5% of the total square area of the plot, whereas with \nthe CA 125 indicator it spans almost 98%. Also, the two training methods are seen \nto give comparable results. Figure 3 shows the ROC-curve calculated for the total \npopulation for the 11-3-1 MLP case, trained with simulated annealing \n\nTable 2 \n10-5-1 MLP, MSE \n10-5-1 MLP, SA \n11-3-1 MLP, MSE \n11-3-1 MLP, SA \n\nTraining set Test set Total population \n\n96.7% \n96.6% \n97.9% \n97.9% \n\n96.4% \n96.2% \n97.4% \n97.5% \n\n96.5% \n96.4% \n97.7% \n97.8% \n\nTable 2: For the two architectures (10-5-1 and 11-3-1) of the MLP and for the gradient \n(MSE) and the simulated annealing (SA) optimization techniques, this table gives the \nresulting areas under the ROC-curves . \n\n.. \n\n07 \n\n\u00b00 \n\n0. 1 \n\nG.2 \n\n\":I \n\n0 4 \n\n0.' \n\n0, \n\n07 \n\n01 \n\n0, \n\n, \n\nFigure 3: ROC-curves of 11-3-1 MLP (with CA 125 level indicator), trained with simulated \nannealing. The curve, calculated for the total population, spans 97.8% of the total region. \n\nAll patients were examined by two gynecologists, who gave their subjective impres(cid:173)\nsions and also classified the ovarian tumors into (probably) benign and malignant. \nHistopathological examinations of the tumors afterwards showed these gynecologists \n\n\f984 \n\nH. Vendst, Y. Moreau, 1. Vandewalle and D. Timmerman \n\nto have a sensitivity up to 98% and a false positive rate of 13% and 12% respec(cid:173)\ntively. As can be seen in Figure 3, the 11-3-1 MLP has a similar performance. For \na sensitivity of 98%, its false positive rate is between 10% and 15%. \n\n5 Conclusion \n\nIn this paper we have discussed the development of a Multi-Layer Perceptron neural \nnetwork classifier for use in preoperative differentiation between benign and malig(cid:173)\nnant ovarian tumors. To assess the performance and for training the classifiers, the \nconcepts of sensitivity and specificity were introduced and combined in Receiver \nOperating Characteristic curves. Based on objective observations available to ev(cid:173)\nery gynecologist, the neural network is able to make reliable predictions with a \ndiscriminating performance comparable to that of experienced gynecologists. \n\nAcknowledgments \n\nThis research work was carri ed out at the ESAT laboratory and the Interdisciplinary Center of Neural Networks \n\nICNN of the Katholieke Universiteit Leuven, in the following frameworks : the Belgian Programme on Interuni(cid:173)\n\nversity Poles of Attraction, initiated by the Belgian State, Prime Minister 's Office for Science, Technology and \n\nCulture (IUAP P4-02 and IUAP P4-24), a Concerted Action Project MIPS (Model based Information Processing \n\nSystems) of the Flemish Community and the FWO (Fund for Scientific Research - Flanders) project G.0262.97 : \n\nLearning and Optimization: an Interdisciplinary Approach . The scientific responsibility rests with its authors . \n\nReferences \n\n[1] Bast R. C., Jr., Klug T .L. St. John E., et aI, \"A radioimmunoassay using a monoclonal \nantibody to monitor the course of epithelial ovarian cancer,\" N. Engl. J . Med., Vol. \n309, pp. 883-888, 1983 \n\n[2] Timmerman D., Bourne T ., Tailor A., Van Assche F .A., Vergote 1., \"Preoperative \n\ndifferentiation between benign and malignant adnexal masses,\" submitted \n\n[3] Tailor A., Jurkovic D., Bourne T.H., Collins W.P., Campbell S., \"Sonographic predic(cid:173)\ntion of malignancy in adnexal masses using multivariate logistic regression analysis,\" \nUltrasound Obstet. Gynaecol. in press, 1997 \n\n[4] Jacobs 1., Oram D., Fairbanks J ., et aI. , \"A risk of malignancy index incorporating \nCA 125, ultrasound and menopausal status for the accurate preoperative diagnosis \nof ovarian cancer,\" Br. J. Obstet. Gynaecol., Vol. 97, pp. 922-929, 1990 \n\n[5] Hanley J.A., McNeil B., \"A method of comparing the areas under the receiver op(cid:173)\n\nerating characteristics curves derived from the same cases,\" Radiology, Vol. 148, pp. \n839-843, 1983 \n\n[6J Swets J.A., \"Measuring the accuracy of diagnostic systems,\" Science, Vol. 240, pp. \n\n1285-1293, 1988 \n\n[7J Galen R.S., Gambino S., Beyond normality: the predictive value and efficiency of \n\nmedical diagnosis, John Wiley, New York, 1975. \n\n[8J Gill P., Murray W., Wright M., Practical Optimization, Acad. Press, New York, 1981 \n[9] Fletcher R., Practical methods of optimization, 2nd ed., John Wiley, New York, 1987. \n[10J Kirkpatrick S., Gelatt C.D., Vecchi M., \"Optimization by simulated annealing,\" Sci(cid:173)\n\nence, Vol. 220, pp. 621-680, 1983. \n\n[11] Rumelhart D.E., Hinton G.E., Williams R.J ., \"Learning representations by back(cid:173)\n\npropagating errors,\" Nature, Vol. 323, pp. 533-536, 1986. \n\n[12] Bishop C., Artificial Neural Networks for Pattern Recognition, OUP, Oxford, 1996 \n\n\f", "award": [], "sourceid": 1407, "authors": [{"given_name": "Herman", "family_name": "Verrelst", "institution": null}, {"given_name": "Yves", "family_name": "Moreau", "institution": null}, {"given_name": "Joos", "family_name": "Vandewalle", "institution": null}, {"given_name": "Dirk", "family_name": "Timmerman", "institution": null}]}