{"title": "Incorporating Second-Order Functional Knowledge for Better Option Pricing", "book": "Advances in Neural Information Processing Systems", "page_first": 472, "page_last": 478, "abstract": null, "full_text": "Incorporating Second-Order Functional \n\nKnowledge for Better Option Pricing \n\nCharles Dugas, Yoshua Bengio, Fran~ois Belisle, Claude Nadeau:Rene Garcia \n\nCIRANO, Montreal, Qc, Canada H3A 2A5 \n\n{du gas ,beng i oy,beli s lfr ,na de a u c }@i ro .umont r e a l. ca \n\ngarc i ar@c i ran o .qc . ca \n\nAbstract \n\nIncorporating prior knowledge of a particular task into the architecture \nof a learning algorithm can greatly improve generalization performance. \nWe study here a case where we know that the function to be learned is \nnon-decreasing in two of its arguments and convex in one of them. For \nthis purpose we propose a class of functions similar to multi-layer neural \nnetworks but (1) that has those properties, (2) is a universal approximator \nof continuous functions with these and other properties. We apply this \nnew class of functions to the task of modeling the price of call options. \nExperiments show improvements on regressing the price of call options \nusing the new types of function classes that incorporate the a priori con(cid:173)\nstraints. \n\n1 Introduction \n\nIncorporating a priori knowledge of a particular task into a learning algorithm helps re(cid:173)\nducing the necessary complexity of the learner and generally improves performance, if the \nincorporated knowledge is relevant to the task and really corresponds to the generating pro(cid:173)\ncess of the data. In this paper we consider prior knowledge on the positivity of some first \nand second derivatives of the function to be learned. In particular such constraints have \napplications to modeling the price of European stock options. Based on the Black-Scholes \nformula, the price of a call stock option is monotonically increasing in both the \"money(cid:173)\nness\" and time to maturity of the option, and it is convex in the \"moneyness\". Section 3 \nbetter explains these terms and stock options. For a function f(Xl, X2) of two real-valued \narguments, this corresponds to the following properties: \n\n> 0, \n\n{Pf>o \n8xr -\n\n(1) \n\nThe mathematical results of this paper (section 2) are the following: first we intro(cid:173)\nduce a class of one-argument functions (similar to neural networks) that is positive, non(cid:173)\ndecreasing and convex in its argument, and we show that this class of functions is a univer(cid:173)\nsal approximator for positive functions with positive first and second derivatives. Second, \nin the main theorem, we extend this result to functions of two or more arguments, with \nsome having the convexity property and all having positive first derivative. This result rests \non additional properties on cross-derivatives, which we illustrate below for the case of two \n\n\u00b7C.N. is now with Health Canada at Cl aude-.Na d eau@hc-sc . gc . c a \n\n\farguments: \n\n(2) \n\nComparative experiments on these new classes of functions were performed on stock option \nprices, showing some improvements when using these new classes rather than ordinary \nfeedforward neural networks. The improvements appear to be non-stationary but the new \nclass of functions shows the most stable behavior in predicting future prices. The detailed \nresults are presented in section 5. \n\n2 Theory \nDefinition \nA class of functions :i from IRn to IR is a universal approximator for a class of functions \nF from IRn to IR if for any f E F, any compact domain D C IRn , and any positive E, one \ncan find a j E :i with sUP\"'ED If(x) - j(x)1 ~ E-\nIt has already been shown that the class of artificial neural networks with one hidden layer \n\nN = {f(x) = bo + 2: Wih(bi + 2: VijXj)} \n\n(3) \n\ne.g. with a sigmoid activation function h(s) = l+~-\" are universal approximators of \ncontinuous functions [1, 2, 5]. The number of hidden units H of the neural network is a \nhyper-parameter that controls the accuracy of the approximation and it should be chosen to \nbalance the trade-off between accuracy (bias of the class of functions) and variance (due to \nthe finite sample used to estimate the parameters of the model), see also [6]. \n\nSince h is monotonically increasing, it is easy to force the first derivatives with respect to \nx to be positive by forcing the weights to be positive, for example with the exponential \nfunction: \n\nN+ = {f(x) = bo + 2: eWi h(bi + 2: eVii Xj)} \n\n(4) \n\nbecause h'(s) = h(s)(1- h(s)) > o. \nSince the sigmoid h has a positive first derivative, its primitive, which we call softplus, is \nconvex: \n\n((s) = log(1 + eS ) \n\n(5) \ni.e., d((s)/ds = h(s) = 1/(1 + C S ). The basic idea of the proposed class of functions \nN++ is to replace the sigmoid of a sum by a product of softplus or sigmoid functions over \neach of the dimensions (using the softplus over the convex dimensions and the sigmoid \nover the others): \n\ncN++={f(x)=ebo+2:eWi(II((bij+eviiXj))( II h(bij+eViixj))} \n\nHen \n\n(6) \n\ni=l \n\nj=l \n\nj=c+l \n\nOne can readily check that the first derivatives wrt Xj are positive, and that the second \nderivatives wrt Xj for j ~ c are positive. However, this class of functions has other prop(cid:173)\nerties. Let (it,\u00b7\u00b7\u00b7 ,jm) be a set of indices with 1 ~ ji ~ c (convex dimensions), and let \n(jt, ... , j~) be a set of indices c + 1 ~ j~ ~ n (the other dimensions), then \n\nam +v f \n\na2m+v f \n\naXjl ... aXj= aXj~ ... Xj~ ~ 0, \n\n(7) \nNote that m or p can be 0, so as special cases we find that f is positive, and that it is \nmonotonically increasing w.r.t. all its inputs, and convex w.r.t. the first c inputs. \n\naX]l . .. aX]= aXj~ ... Xj~ ~ 0 \n\nH \n\ni=l \n\nH \n\ni=l \n\nj \n\nj \n\n\f2.1 Universality of cN++ over ~ \nTheorem Within the set F ++ of continuous functions from ~n to ~ whose first and second \nderivatives are non-negative (as specified by equation 7), the class cN++ is a universal \napproximator. \n\nProof \nFor lack of space we only show here a sketch of the proof, and only for the case n = 2 \nand c = 1 (one convex dimension and one other dimension), but the same principle allows \nto prove the more general case. Let f(x) E F++ be the function to approximate with a \nfunction 9 E IN++. To perform our approximation we will restrict 9 to the subset of \nIN++ where the sigmoid becomes a step function B(x) = [x >o and where the softplus \nbecomes the positive part function x+ = max(O, x). Let D be the compact domain of \ninterest and t: the desired approximation precision. We focus our attention on an axis(cid:173)\naligned rectangle T with lower-left comer (ai, bl ) and upper right comer (a2' b2) such \nthat it is the smallest such rectangle enclosing D and it can be partitionned into squares of \nlength L forming a grid such that the value of f at neighboring grid points does not differ \nby more than t:. The number of square grids on the Xl axis is Nl and the number on the X2 \naxis is N2. The number of hidden units is H = (Nl + 1)(N2 + 1). Let Xij = (Xi, Xj) = \n(al + iL, bl + jL) be the grid points, with i = 0,1, ... , N l , j = 0,1, ... , N2. Also, \nx = (Xl, X2). With k = i(N2 + 1) + j, we recursively build a series of functions gk(X) as \nfollows : \n\nwith increment \n\nfor k = 1 to H and with initial approximation go = f(al, bl ). The final approximation \nis g(x) = gH(X), It is exact at every single point on the grid and within t: of the true \nfunction value anywhere within D. To prove this, we need to show that at every step of the \nrecursive procedure, the necessary increment is nonnegative (since it must be equated with \ne Wk). First note that the value of 9 H (Xij) is strictly affected by the set of increments ~st \nfor which s <= i and t <= j so that, \n\nf(Xij) = gH(Xij) = L: L: ~st(i - s + 1)L \n\nj \n\ns=Ot=o \n\nIsolating ~ij and doing some algebra, we get, \n\n~ij = ~;1,Xl,X2gH(Xij)L2 \n\nwhere ~~i ,Xj,Xk is the third degree finite difference with respect to arguments Xi, Xj, Xk, \ni.e. ~~1,X l,XJ(Xl,X2) = (~~1,XJ(Xl,X2) - ~~l,xJ(Xl - L,X2))/L, where simi(cid:173)\nlarly ~~l,xJ(Xl,X2) = (~xlf(Xl,X2) - ~xJ(Xl,X2 - L))/L, and ~xlf(Xl,X2) = \n(f(Xl' X2) -\nf(Xl - L, X2))/ L. By the mean value theorem, the third degree finite differ(cid:173)\nence is nonnegative if the corresponding third derivative is nonnegative everywhere over the \nfinite interval which is obtained by constraint 7. Finally, the third degree finite difference \nbeing nonnegative, the corresponding increment is also nonnegative and this completes the \nproof. \n\nCorollary Within the set of positive continuous functions from ~ to ~ whose first and \nsecond derivatives are non-negative, the class IN++ is a universal approximator. \n\n\f3 Estimating Call Option Prices \n\nAn option is a contract between two parties that entitles the buyer to a claim at a future \ndate T that depends on the future price, ST of an underlying asset whose price at time t is \nSt. In this paper we consider the very common European call options, in which the value \nof the claim at maturity (time T) is max(O, ST - K), i.e. if the price is above the strike \nprice K, then the seller of the option owes ST - K dollars to the buyer. In the no-arbitrage \nframework, the call function is believed to be a function of the actual market price of the \nsecurity (St), the strike price (K), the remaining time to maturity (T = T - t), the risk \nfree interest rate (r) , and the volatility of the return (a). The challenge is to evaluate the \nvalue of the option prior to the expiration date before entering a transaction. The risk free \ninterest rate (r) needs to be somehow extracted from the term structure and the volatility \n(a) needs to be forecasted, this latest task being a field of research in itself. We have [3] \npreviously tried to feed in neural networks with estimates of the volatility using historical \naverages but so far, the gains remained insignificant. We therefore drop these two features \nand rely on the ones that can be observed: St, K, T. One more important result is that \nunder mild conditions, the call option function is homogeneous of degree one with respect \nto the strike price and so our final approximation depends on two variables: the moneyness \n(M = Stl K) and the time to maturity (T). \n\nf(M, T) \n\nctl K = \n\n(8) \nAn economic theory yielding to the Black-Scholes formula suggest that f has the properties \nof (1), so we will evaluate the advantages brought by the function classes of the previous \nsection. However, it is not clear whether the constraint on the cross derivatives that are \nincorporated in IN++ should or not be present in the true price function. It is known that \nthe Black-Scholes formula does not adequately represent the market pricing of options, but \nit might still be a useful guide in designing a learning algorithm for option prices. \n\n4 Experimental Setup \nAs a reference model, we use a simple multi-layered perceptron with one hidden layer \n(eq. 3). We also compare our results with a recently proposed model [4] that closely resem(cid:173)\nbles the Black-Scholes formula for option pricing (i.e. another way to incorporate possibly \nuseful prior knowledge): \n\nyES \n\nnh \n\na + M . L i31 ,i . h('Yi,o + 'Yi,l . M + 'Yi,2 . T) \n\ni=l \n\n+ e- r r . L i32 ,i . hbi,3 + 'Yi,4 . M + 'Yi,5 . T). \n\nnh \n\n(9) \n\ni=l \n\nWe evaluate two new architectures incorporating some or all of the constraints defined in \nequation 7. \n\nWe used european call option data from 1988 to 1993. A total of 43518 transaction prices \non european call options on the S&P500 index were used. In section 5, we report results \non 1988 data. In each case, we used the first two quarters of 1988 as a training set (3434 \nexamples), the third quarter as a validation set (1642 examples) for model selection and 4 \nto 20 quarters as a test sets (each with around 1500 examples) for final generalization error \nestimation. In tables 1 and 2, we present results for networks with unconstrained weights \non the left-hand side, and weights constrained to positive and monotone functions through \nexponentiation of parameters on the right-hand side. For each model, the number of hidden \nunits varies from one to nine. The mean squared error results reported were obtained as \nfollows : first, we randomly sampled the parameter space 1000 times. We picked the best \n(lowest training error) model and trained it up to 1000 more times. Repeating this procedure \n\n\f10 times, we selected and averaged the performance of the best of these 10 models (those \nwith training error no more than 10% worse than the best out of 10). In figure 1, we present \ntests of the same models on each quarter up to and including 1993 (20 additional test sets) \nin order to assess the persistence (conversely, the degradation through time) of the trained \nmodels. \n\n5 Forecasting Results \n\nSimple Multi-Layered Perceptrons \n\nMean Squared Error Results on Call Option Pricing (x 10- 4 ) \nConstrained weights \n\nUnconstrained weights \n\nUnits \n\nTrain Valid Test! Test2 Train Valid Test! Test2 \n2.38 \n3.60 \n3.81 \n1.68 \n1.40 \n3.79 \n1.42 \n3.70 \n3.64 \n1.40 \n1.41 \n3.81 \n3.71 \n1.41 \n1.41 \n3.80 \n1.40 \n3.67 \n\n2.67 \n6.06 \n5.70 2.63 \n27.31 \n2.63 \n2.65 \n27.32 \n2.67 \n30.56 \n2.63 \n33.12 \n2.65 \n33.49 \n2.63 \n39.72 \n2.66 \n38.07 \n\n3.02 \n3.08 \n3.07 \n3.05 \n3.03 \n3.08 \n3.05 \n3.07 \n3.04 \n\n2.32 \n2.14 \n2.15 \n2.24 \n2.29 \n2.14 \n2.23 \n2.14 \n2.27 \n\n2.73 \n1.51 \n1.27 \n1.25 \n1.27 \n1.24 \n1.26 \n1.24 \n1.24 \n\n1.92 \n1.76 \n1.39 \n1.44 \n1.38 \n1.43 \n1.41 \n1.43 \n1.41 \n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n\nBlack-Scholes Similar Networks \n\nMean Squared Error Results on Call Option Pricing (x 10-4 ) \nConstrained weights \n\nUnconstrained weights \n\nUnits \n\nTrain Valid Test! Test2 Train Valid Test! Test2 \n1.54 \n1.42 \n1.40 \n1.40 \n1.40 \n1.41 \n1.40 \n1.40 \n1.42 \n\n4.70 2.49 \n24.53 \n1.90 \n1.88 \n30.83 \n31.43 \n1.85 \n30.82 \n1.87 \n35.77 \n1.89 \n1.87 \n35.97 \n34.68 \n1.86 \n32.65 \n1.92 \n\n2.17 \n1.71 \n1.73 \n1.70 \n1.70 \n1.70 \n1.72 \n1.69 \n1.73 \n\n3.61 \n3.19 \n3.72 \n3.15 \n3.51 \n3.19 \n3.12 \n3.25 \n3.17 \n\n2.78 \n2.05 \n2.00 \n1.96 \n2.01 \n2.04 \n1.98 \n1.98 \n2.08 \n\n1.58 \n1.42 \n1.41 \n1.39 \n1.40 \n1.42 \n1.40 \n1.40 \n1.43 \n\n1.40 \n1.27 \n1.24 \n1.27 \n1.25 \n1.25 \n1.25 \n1.25 \n1.26 \n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n\nTable 1: Left: the parameters are free to take on negative values. Right: parameters are \nconstrained through exponentiation so that the resulting function is both positive and mono(cid:173)\ntone increasing everywhere w.r.t. to both inputs. Top: regular feedforward artificial neural \nnetworks. Bottom: neural networks with an architecture resembling the Black-Scholes for(cid:173)\nmula as defined in equation 9. The number of units varies from 1 to 9 for each network \narchitecture. The first two quarters of 1988 were used for training, the third of 1988 for \nvalidation and the fourth of 1988 for testing. The first quarter of 1989 was used as a second \ntest set to assess the persistence of the models through time (figure 1). In bold: test results \nfor models with best validation results. \n\nAs can be seen in tables 1 and 2, the positivity constraints through exponentiation of the \nweights allow the networks to avoid overfitting. The training errors are generally slightly \nlower for the networks with unconstrained weights, the validation errors are similar but fi(cid:173)\nnal test errors are disastrous for unconstrained networks, compared to the constrained ones. \nThis \"liftoff' pattern when looking at training, validation and testing errors has triggered \nour attention towards the analysis of the evolution of the test error through time. The un(cid:173)\nconstrained networks obtain better training, validation and testing (test 1) results but fail in \n\n\fProducts of SoftPlus and Sigmoid Functions \n\nMean Squared Error Results on Call Option Pricing (x 10 - 4 ) \nConstrained weights \n\nUnconstrained weights \n\nUnits \n\nTrain Valid Testl Test2 Train Valid Test1 Test2 \n3.51 \n2.27 \n1.61 \n3.48 \n3.48 \n1.51 \n4.19 \n1.46 \n4.18 \n1.57 \n4.09 \n1.51 \n1.62 \n4.10 \n4.25 \n1.55 \n1.46 \n4.12 \n\n2.28 \n3.27 \n14.24 2.28 \n18.16 2.28 \n1.84 \n20.14 \n1.83 \n10.03 \n22.47 \n1.85 \n1.86 \n7.78 \n1.84 \n11.58 \n26.13 \n1.87 \n\n2.14 \n2.13 \n2.13 \n1.54 \n1.56 \n1.57 \n1.55 \n1.55 \n1.60 \n\n2.37 \n2.37 \n2.36 \n1.97 \n1.95 \n1.97 \n2.00 \n1.96 \n1.97 \n\n2.15 \n1.58 \n1.53 \n1.51 \n1.57 \n1.53 \n1.67 \n1.54 \n1.47 \n\n2.35 \n1.58 \n1.38 \n1.29 \n1.46 \n1.35 \n1.46 \n1.44 \n1.31 \n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n\nSums of SoftPlus and Sigmoid functions \n\nMean Squared Error Results on Call Option Pricing (x 10-4 ) \nConstrained weights \n\nUnconstrained weights \n\nUnits \n\nTrain Valid Testl Test2 Train Valid Test1 Test2 \n3.43 \n1.83 \n3.39 \n1.42 \n4.11 \n1.45 \n4.09 \n1.56 \n4.21 \n1.60 \n4.12 \n1.57 \n3.94 \n1.61 \n4.25 \n1.64 \n4.25 \n1.65 \n\n4.10 2.30 \n25.00 2.29 \n1.84 \n35.00 \n21.80 \n1.85 \n10.11 \n1.85 \n14.99 \n1.86 \n1.86 \n8.00 \n1.85 \n7.89 \n6.16 \n1.84 \n\n2.19 \n2.19 \n1.58 \n1.56 \n1.52 \n1.54 \n1.60 \n1.54 \n1.54 \n\n2.36 \n2.34 \n1.95 \n1.99 \n2.00 \n2.00 \n1.98 \n1.98 \n1.97 \n\n1.59 \n1.45 \n1.46 \n1.69 \n1.69 \n1.66 \n1.67 \n1.72 \n1.70 \n\n1.93 \n1.26 \n1.32 \n1.33 \n1.42 \n1.39 \n1.48 \n1.48 \n1.52 \n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n\nTable 2: Similar results as in table 1 but for two new architectures. Top: products of softplus \nalong the convex axis with sigmoid along the monotone axis. Bottom: the softplus and \nsigmoid functions are summed instead of being multiplied. Top right: the fully constrained \nproposed architecture. \n\nthe extra testing set (test 2). Constrained architectures seem more robust to changes in un(cid:173)\nderlying econometric conditions. The constrained Black-Scholes similar model performs \nslightly better than other models on the second test set but then fails on latter quarters (fig(cid:173)\nure 1). All in all, at the expense of slightly higher initial errors our proposed architecture \nallows us to forecast with increased stability much farther in the future. This is a very \nwelcome property as new derivative products have a tendency to lock in values for much \nlonger durations (up to 10 years) than traditional ones. \n\n6 Conclusions \n\nMotivated by prior knowledge on the derivatives of the function that gives the price of \nEuropean options, we have introduced new classes of functions similar to multi-layer neural \nnetworks that have those properties. We have shown one of these classes to be a universal \napproximator for functions having those properties, and we have shown that using this a \npriori knowledge can help in improving generalization performance. In particular, we have \nfound that the models that incorporate this a priori knowledge generalize in a more stable \nway over time. \n\n\f. , \n\n2 \n\n, \n\n1 \n\n, \n\n0 \n\n\" \" '1 \n.ii \n: 11 .\" \n\u2022 \u2022 \u2022 \n\u2022 , , \n\n\u2022 \n\n. , . , , \n\n, \n\n, , . , . , , . , , '. \n\n\" \n\" \n\" \n\n\" ~ \n\n- ' \n\n'-,J \n\n05 \n\n5 \n20 \nOuar1Ofusodas lest sel tom3rd01 1988 1041ho11993(llCI) \n\n15 \n\n10 \n\n\u00b0O~-----'~--~'~O----~,~,----~ro~--~ \n\nQuartorusodasleslsel 1rom3rd01 1988104lho11993(Ulci) \n\nFigure 1: Out-of-sample results from the third quarter of 1988 to the fourth of 1993 (incl.) \nfor models with best validation results. Left: unconstrained models: results for the Black(cid:173)\nScholes similar network. Other unconstrained models exhibit similar swinging result pat(cid:173)\nterns and levels of errors. Right: constrained models: the fully constrained proposed archi(cid:173)\ntecture (solid). The model with sums over dimensions obtains similar results. The regular \nneural network (dotted). The constrained Black-Scholes model obtains very poor results \n(dashed). \n\nReferences \n\n[1] G. Cybenko. Continuous valued neural networks with two hidden layers are sufficient. \nTechnical report, Department of Computer Science, Tufts University, Medford, MA, \n1988. \n\n[2] G. Cybenko. Approximation by superpositions of a sigmoidal function. 2:303-314, \n\n1989. \n\n[3] C. Dugas, O. Bardou, and Y. Bengio. Analyses empiriques sur des transactions \nd'options. Technical Report 1176, Department d'informatique et de Recherche \nOperationnelle, Universite de Montreal, Montreal, Quebec, Canada, 2000. \n\n[4] R. Garcia and R. Gen~ay. Pricing and Hedging Derivative Securities with Neural \nNetworks and a Homogeneity Hint. Technical Report 98s-35, CIRANO, Montreal, \nQuebec, Canada, 1998. \n\n[5] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are uni(cid:173)\n\nversal approximators. 2:359-366,1989. \n\n[6] 1. Moody. Prediction risk and architecture selection for neural networks. In From \nStatistics to Neural Networks: Theory and Pattern Recognition Applications. Springer, \n1994. \n\n\f", "award": [], "sourceid": 1920, "authors": [{"given_name": "Charles", "family_name": "Dugas", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Fran\u00e7ois", "family_name": "B\u00e9lisle", "institution": null}, {"given_name": "Claude", "family_name": "Nadeau", "institution": null}, {"given_name": "Ren\u00e9", "family_name": "Garcia", "institution": null}]}