{"title": "Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 1369, "page_last": 1376, "abstract": "", "full_text": "Estimating Car Insurance Premia:\n\na Case Study in High-Dimensional Data\n\nInference\n\nNicolas Chapados, Yoshua Bengio, Pascal Vincent, Joumana\n\nGhosn, Charles Dugas, Ichiro Takeuchi, Linyan Meng\n\nUniversity of Montreal, dept. IRQ, CP 6128, Succ. Centre-Ville, Montreal, Qc, Canada, H3C3J7\n{chapadosJbengioy,vincentp,ghosnJdugas,takeuchi,mengl}~iro.umontreal.ca\n\nAbstract\n\nEstimating insurance premia from data is a difficult regression\nproblem for several reasons: the large number of variables, many of\nwhich are .discrete, and the very peculiar shape of the noise distri(cid:173)\nbution, asymmetric with fat tails, with a large majority zeros and a\nfew unreliable and very large values. We compare several machine\nlearning methods for estimating insurance premia, and test them\non a large data base of car insurance policies. We find that func(cid:173)\ntion approximation methods that do not optimize a squared loss,\nlike Support Vector Machines regression, do not work well in this\ncontext. Compared methods include decision trees and generalized\nlinear models. The best results are obtained with a mixture of\nexperts, which better identifies the least and most risky contracts,\nand allows to reduce the median premium by charging more to the\nmost risky customers.\n\n1\n\nIntroduction\n\nThe main mathematical problem faced by actuaries is that of estimating how much\neach insurance contract is expected to cost. This conditional expected claim amount\nis called the pure premium and it is the basis of the gross premium charged to the\ninsured. This expected value is conditionned on information available about the\ninsured and about the contract, which we call input profile here. This regression\nproblem is difficult for several reasons:\nlarge number of examples, -large number\nvariables (most of which are discrete and multi-valued), non-stationarity of the\ndistribution, and a conditional distribution of the dependent variable which is very\ndifferent from those usually encountered in typical applications .of machine learning\nand function approximation. This distribution has a mass at zero: the vast majority\nof the insurance contracts do not yield any claim. This distribution is also strongly\nasymmetric and it has fat tails (on one side only, corresponding to the large claims).\nIn this paper we study and compare several learning algorithms along with methods\ntraditionally used by actuaries for setting insurance premia. The study is performed\non a large database of automobile insurance policies. The methods that were tried\n\n\fare the following:\nthe constant (unconditional) predictor as a benchmark, linear\nregression, generalized linear models (McCullagh and NeIder, 1989), decision tree\nmodels (CHAID (Kass, 1980)), support vector machine regression (Vapnik, 1998),\nmulti-layer neural networks, mixtures of neural network experts, and the current\npremium structure of the insurance company.\nIn a variety of practical applications, we often find data distributions with an asym(cid:173)\nmetric heavy tail extending out towards more positive values. Modeling data with\nsuch an asymmetric heavy-tail distribution is essentially difficult because out(cid:173)\nliers, which are sampled from the tail of the distribution, have a strong influence\non parameter estimation. When the distribution is symmetric (around the mean),\nthe problems caused by outliers can be reduced using robust estimation techniques\n(Huber, 1982; F.R.Hampel et al., 1986; Rousseeuw and Leroy, 1987) which basically\nintend to ignore or downweight outliers. Note that these techniques do not work\nfor an asymmetric distribution: most outliers are on the same side of the mean,\nso downweighting them introduces a strong bias on its estimation: the conditional\nexpectation would be systematically underestimated.\n\nThere is another statistical difficulty, due to the large number of variables (mostly\ndiscrete) and the fact that many interactions exist between them. Thus the tra(cid:173)\nditional actuarial methods based on tabulating average claim amounts for combi(cid:173)\nnations of values are quickly hurt by the curse of dimensionality, unless they\nmake hurtful independence assumptions (Bailey and Simon, 1960). Finally, there\nis a computational difficulty: we had access to a large database of ~ 8 x 106 ex(cid:173)\namples, and the training effort and numerical stability of some algorithms can be\nburdensome for such a large number of training examples.\n\nThis paper is organized as follows: we start by describing the mathematical criteria\nunderlying insurance premia estimation (section 2), followed by a brief review of the\nlearning algorithms that we consider in this study, including our best-performing\nmixture of positive-output neural networks (section 3). We then highlight our most\nimportant experimental results (section 4), and in view of them conclude with an ex(cid:173)\namination of the prospects for applying statistical learning algorithms to insurance\nmodeling (section 5).\n\n2 Mathematical Objectives\n\nThe first goal of insurance premia modeling is to estimate the expected claim amount\nfor a given insurance contract for a future one-year period (here we consider that the\namount is 0 when no claim is filed). Let X E Rm denote the customer and contract\ninput profile, a vector representing all the information known about the customer\nand the proposed insurance policy before the beginning of the contract. Let A E R+\ndenote the amount that the customer claims during the contract period; we shall\nassume that A is non-negative. Our objective is to estimate this claim amount,\nwhich is the pure premium Ppure of a given contract x: 1\n\nPpure(X) == E[AIX == x].\n\n(1)\n\nThe Precision Criterion. In practice, of course, we have no direct access to the\nquantity (1), which we must estimate. One possible criterion is to seek the most\nprecise estimator, which minimizes the mean-squared error (MSE) over a data set\nD == {(xl,a\u00a3)}r=l. Let P == {p(\u00b7;8)} be a function class parametrized by the\nIThe pure premium is distinguished from the premium actually charged to the cus(cid:173)\ntomer, which must account for the risk remaining with the insurer, the administrative\noverhead, desired profit, and other business costs.\n\n\fparameter vector (). The MSE criterion produces the most precise function (on\naverage) within the class, as measured with respect to D:\n\n()* = argm:n ~ L(P(Xi; (}) - ai)2.\n\nL\n\ni=1\n\n(2)\n\nIs it an appropriate criterion and why? First one should note that if PI and P2 are\ntwo estimators of E[AIX]' then the MSE criterion is a good indication of how close\nthey are to E[AIX], since by the law of iterated expectations,\n\nE[(PI(X) - A)2] - E[(P2(X) - A)2]\n\n== E[(PI(X) - E[AIX])2]\n\n-E[(P2(X) - E[AIX])2],\n\nand of course the expected MSE is minimized when p(X) == E[AIX].\nThe Fairness Criterion. However, in insurance policy pricing, the precision cri(cid:173)\nterion is not the sole part of the picture; just as important is that the estimated\npremia do not systematically discriminate against specific segments of the popula(cid:173)\ntion. We call this objective the fairness criterion. We define the bias of the premia\nb(P) to be the difference between the average premium and the average incurred\namount, in a given population P:\n\nb(P) = 1FT L p(Xi) - ai,\n\n1\n\n(xi,ai)EP\n\n(3)\n\nwhere IPI denotes the cardinality of the set P, and p(.) is some premia estimation\nfunction. A possible fairness criterion would be based on minimizing the norm of\nthe bias over every subpopulation Q of P. From a practical standpoint, such a\nminimization would be extremely difficult to carry out. Furthermore, the bias over\nsmall subpopulations is hard to estimate with statistical significance. We settle\ninstead for an approximation that gives good empirical results. After training a\nmodel to minimize the MSE criterion (2), we define a finite number of disjoint\nsubsets (subpopulations) of the test set P, PkC P, Pk n Pj:f;k == 0, and verify that\nthe absolute bias is not significantly different from zero. The subsets Pk can be\nchosen at convenience; in our experiments, we considered 10 subsets of equal-size\ndelimited by the deciles of the test set premium distribution. In this way, we verify\nthat, for example, for the group of contracts with a premium between the 5th and\nthe 6th decile, the average premium matches the average claim amount.\n\n3 Models Evaluated\n\nAn important requirement for any model of insurance premia is that it should pro(cid:173)\nduce positive premia: the company does not want to charge negative money to its\ncustomers! To obtain positive outputs neural networks we have considered\nusing an exponential activation function at the output layer but this created nu(cid:173)\nmerical difficulties (when the argument of the exponential is large, the gradient is\nhuge). fustead, we have successfully used the \"softplus\" activation function (Dugas\net al., 2001):\n\nsoftplus(s) == log(1 + e 8\n\n)\n\nwhere s is the weighted sum of an output neuron, and softplus(s) is the correspond(cid:173)\ning predicted premium. Note that this function is convex, monotone increasing, and\ncan be considered as a smooth version of the \"positive part\" function max(O, x).\nThe best model that we obtained is a mixture of experts in which the experts\nare positive outputs neural networks. The gater network (Jacobs et al., 1991)\nhas softmax outputs to obtain positive w~ights summing to one.\n\n\fX 10-3 Distribution of (claim - prediction) in each prediction quintile\n\n2\n\n1.8\n\n1.6\n\n1.4\n\n1.2\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\noL-.._..I....=~-L-~-l...-_----L_----l-==:::=:~::=::::::r:::===:\u00b1==~\n\n1000\n2000\nclaim - prediction\n\n3000\n\n4000\n\n-3000\n\n-2000\n\n-1000\n\n5000\n\n6000\n\nProportion of non-zero claims in each prediction quintile\n\n0.25 r - - - r - - - - - - - , r - - - - - - - - - - , - - - - - - , . - - - - - - - - , - - - ,\n\n0.15\n\n0.1\n\n0.05\n\n3\n\nquintile\n\nFigure 1: A view of the conditional distribution of the claim amounts in the out-of(cid:173)\nsample test set. Top: probability density of (claim amount - conditional expectation) for\n5 quintiles of the conditional expectation, excluding zero-claim records. The mode moves\nleft for increasing conditional expectation quintiles. Bottom: proportion of non-zero claim\nrecords per quintile of the prediction.\n\nThe mixture model was compared to other models. The constant model only\nhas intercepts as free parameters. The linear model corresponds to a ridge linear\nregression (with weight decay chosen with the validation set). Generalized linear\nmodels (GLM) estimate the conditional expectation from j(x) == eb+w1x with\nparameters b and w. Again weight decay is used and tuned on the validation set.\nThere are many variants of GLMs and they are popular for building insurance\nmodels, since they provide positive outputs, interpretable parameters, and can be\nassociated to parametric models of the noise.\n\nDecision trees are also used by practitioners in the insurance industry, in particular\nthe CHAID-type models (Kass, 1980; Biggs, Ville and Suen, 1991), which use\nstatistical criteria for deciding how to split nodes and when to stop growing the tree.\nWe have compared our models with a CHAID implementation based on (Biggs, Ville\nand Suen, 1991), adapted for regression purposes using a MANOVA analysis. The\nthreshold parameters were selected based on validation set MSE.\n\nRegression Support Vector Machines (SVM) (Vapnik, 1998) were also evaluated\n\n\fMean-Squared Error\n\n.......................................................................................... :;....:-.:--'*.....\n\n....~~--:-~-:-~~..:..:.~:-.:-.. -.. - -~-~--:.-.. _..-.~~--'---~.. :\n\n-*\"------\n\n.\n\n.\n\nTest\n\nValidation\n\nTraining\n\n67.1192\n\n67.0851\n\n56.5744\n\n56.5416\n\n56.1108\n\n56.0743\n\nFigure 2: MSE results for eight models. Models have been sorted in ascending order\nof test results. The training, validation and test curves have been shifted closer together\nfor visualization purposes (the significant differences in MSE between the 3 sets are due\nto \"outliers\"). The out-of-sample test performance of the Mixture model is significantly\nbetter than any of the other. Validation based model selection is confirmed on test results.\nCondMean is a constructive greedy version of GLM.\n\nbut yielded disastrous results for two reasons: (1) SVM regression optimizes an L 1(cid:173)\nlike criterion that finds a solution close to the conditional median, whereas the\nMSE criterion is minimized for the conditional mean, and because the distribution\nis highly asymmetric the conditional median is far from the conditional mean; (2)\nbecause the output variable is difficult to predict, the required number of support\nvectors is huge, also yielding poor generalization. Since the median is actually 0\nfor our data, we tried to train the SVM using only the cases with positive claim\namounts, and compared the performance to that obtained with the GLM and the\nneural network. The SVM is still way off the mark because of the above two reasons.\nFigure 1 (top) illustrates the fat tails and asymetry of the conditional distribution\nof the claim amounts.\n\n.\n\nFinally, we compared the best statistical model with a proprietary table-based and\nrule-based premium estimation method that was provided to us as the benchmark\nagainst which to judge improvements.\n\n4 Experimental Results\n\nData from five kinds of losses were included in the study (Le. a sub-premium was\nestimated for each type of loss), but we report mostly aggregated results showing\nthe error on the total estimated premium. The input variables contain information\nabout the policy (e.g., the date to deal with inflation, deductibles and options), the\ncar, and the driver (e.g., about past claims, past infractions, etc...). Most variables\nare subject to discretization and binning. Whenever possible, the bins are chosen\nsuch that they contain approximately the same number of observations. For most\nmodels except CHAID, the discrete variables are one-hot encoded. The number of\ninput random variables is 39, all discrete except one, but using one-hot encoding this\nresults in an input vector x of length m == 266. An overall data set containing about\n\n\fTable 1: Statistical comparison of the prediction accuracy difference between several\nindividual learning models and the best Mixture model. The p-value is given under the\nnull hypothesis oino difference between Model #1 and the best Mixture model. Note that\nall differences are statistically significant.\n\nModel #1\nModel #2\nConstant\nMixture\nCHAID\nMixture\nGLM\nMixture\nSoftplus NN Mixture\nLinear\nMixture\nMixture\nNN\n\nMean MSE Diff.\n3.40709e-02\n2.35891e-02\n7.54013e-03\n6.71066e-03\n5.82350e-03\n5.23885e-03\n\nStd. Error\n3.32724e-03\n2.57762e-03\n1.15020e-03\n1.09351e-03\n1.32211e-03\n1.41112e-03\n\nZ\n10.2400\n9.1515\n6.5555\n6.1368\n4.4047\n3.7125\n\np-value\n0\n0\n2.77e-ll\n4.21e-l0\n5.30e-06\n1.02e-04\n\nTable 2: MSE difference between benchmark and Mixture models across the 5 claim\ncategories (kinds of losses) and the total claim amount. In all cases except category 1, the\nIvIixture model is statistically significantly (p < 0.05) more precise than the benchmark\nmodel.\n\nClaim Category\n(Kind of Loss)\nCategory 1\nCategory 2\nCategory 3\nCategory 4\nCategory 5\nTotal claim amount\n\nMSE Difference\n\nBenchmark minus Mixture\n\n20669.53\n1305.57\n244.34\n1057.51\n1324.31\n60187.60\n\n95% Confidence Interval\nHigher\n\nLower\n\n(-4682.83 -\n46021.89 )\n(1032.76 -\n1578.37 )\n(6.12 -\n482.55 )\n(623.42 -\n1491.60 )\n(1077.95 -\n1570.67 )\n( 7743.96 - 112631.24)\n\n8 million examples is randomly permuted and split into a training set, validation\nset and test set, respectively of size 50%, 25% and 25% of the total. The validation\nset is used to select among models (includi~g the choice of capacity), and th~ test\nset is used for final statistical comparisons. Sample-wise paired statistical tests are\nused to reduce the effect of huge per-sample variability.\n\nFigure 1 is an attempt at capturing the shape of the conditional distribution of claim\namounts given input profiles, by considering the distributions of claim amounts in\ndifferent quantiles of the prediction (pure premium), on the test set. The top figure\nexcludes the point mass of zero claims and rather shows the difference between the\nclaim amount and the estimated conditional expectation (obtained with the mixture\nmodel). The bottom histogram shows that the fraction of claims increases nicely\nfor the higher predicted pure premia.\n\nTable 1 and Figure 2 summarize the comparison between the test MSE of the dif(cid:173)\nferent tested models. NN is a neural network with linear output activation whereas\nSoftplus NNhas the softplus output activations. The Mixture is the mixture of soft(cid:173)\nplus neural networks. This result identifies the mixture model with softplus neural\nnetworks as the best-performing of the tested statistical models. Our conjecture is\nthat the mixture model works better because it is more robust to the effect of \"out(cid:173)\nliers\" (large claims). Classical robust regression methods (Rousseeuw and Leroy,\n1987) work by discarding or downweighting outliers:\nthey cannot be applied here\nbecause the claims distribution is highly asymmetric (the extreme values are always\nlarge ones, the claims being all non-negative). Note that the capacity of each model\nhas been tuned on the validation set. Hence, e.g. CHAID could have easily yielded\nlower training error, but at the price of worse generalization.\n\n\f4\n\nx10\n\nRule-Based minus UdeM Mixture\n\n2,...-------,------,-----r-------.-------,-----.---------.----.......,\n\n-\n\nMean = -1.5993e-1 a\nMedian = 37.5455\n\n... Stddev = 154.65\n\n~o\nc\n(])\n:::l\n0(cid:173)\n\n(])u:\n\n1.5\n\n0.5\n\nOL.-..----L.----L----.L.----..L---~~\n\n-3000\n\n-2500\n\n-2000\n\n-1500\nDifference between premia ($)\n\n-1000\n\n-500\n\no\n\n500\n\n1000\n\nFigure 3: The premia difference distribution is negatively skewed, but has a positive\nm~dian for a mean of zero. This implies that the benchmark model (current pricing)\nundercharges risky customers, while overcharging typical customers.\n\nTable 2 shows a comparison of this model against the rule-based benchmark. The\nimprovements are shown across the five types of losses.\nIn all cases the mixture\nimproves, and the improvement is significant in four out of the five as well as across\nthe sum of the five.\n\nA qualitative analysis of the resulting predicted premia shows that the mixture\nmodel has smoother and more spread-out premia than the benchmark. The anal(cid:173)\nysis (figure 3) also reveals that the difference between the mixture premia and the\nbenchmark premia is negatively skewed, with a positive median, i.e., the typical cus(cid:173)\ntomer will pay less under the new mixture model, but the \"bad\" (risky) customers\nwill pay much more.\n\nTo evaluate fairness, as discussed in the previous section, the distribution of pre(cid:173)\nmia computed by the best model is analyzed, splitting the contracts in 10 groups\naccording to their premium level. Figure 4 shows that the premia charged are fair\nfor each sub-population.\n\n5 Conclusion\n\nThis paper illustrates a successful data-mining application in the insurance industry.\nIt shows that a specialized model (the mixture model), that was designed taking\ninto consideration the specific problem posed by the data (outliers, asymmetric dis(cid:173)\ntribution, positive outputs), performs significantly better than existing and popular\nlearning algorithms. It also shows that such models can significantly improve over\nthe current practice, allowing to compute premia that are lower for less risky con(cid:173)\ntracts and higher for more risky contracts, thereby reducing the cost of the median\ncontract.\n\nFuture work should investigate in more detail the role of temporal pon-stationarity,\nhow to optimize fairness (rather than just test for it afterwards), and how to further\nincrease the robustness of the model with respect to large claim amounts.\n\n\fDifference with incurred claims (sum of all KOL-groups)\n\n200\n\n. . . . . . . . . . . . 0\n..\n\n~\n\n\u00b0 0\n..\n\n0\n\n.\n\n.'\n..\n\n..\n\nC/}\n\n0\n\nE\n~\n\"'C\n~\n:5\n\"~ -200\n-5\n\"\u00a7\nCD\n\n~ -400\nCD\n~o\n\n-600\n\nI\n:\n\u00b7\n\u00b7\n.......:\n\u00b7\n\n:\n:1'\n. . \\\n. .\n:\n\u00b71\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 .\\.\n.\n\\\n\n\\\n\n-B- Mixture Model (normalized premia)\n-* - Rule-Based Model (normalized premia)\n\n2\n\n4\n\n6\n\nDecile\n\n8\n\n10\n\nFigure 4: We ensure fairness by comparing the average incurred amount and premia\nwithin each decile of the premia distribution; both models are generally fair to subpop(cid:173)\nu1ations. The error bars denote 95% confidence intervals. The comparisqn is for the sum\nof claim amounts over all 5 kinds of losses (KOL).\n\nReferences\n\nBailey, R. A. and Simon, L. (1960). Two studies in automobile insurance ratemak(cid:173)\n\ning. ASTIN Bulletin, 1(4):192-217.\n\nDugas, C., Bengio, Y., Belisle, F., and Nadeau, C. (2001).\n\nBiggs, D., Ville, B., and Suen, E. (1991). A method of choosing multiway partitions\nfor classification and decision trees. Journal of Applied Statistics, 18(1):49-62.\nIncorporating second\nIn Leen, T., Dietterich,\norder functional\u00b7 knowledge into learning algorithms.\nT., and Tresp, V., editors, Advances in Neural Information Processing Systems,\nvolume 13, pages 472-478.\n\nF.R.Hampel, E.M.Ronchetti, P.J.Rousseeuw, and W.A.Stahel\n\n(1986).\n\nRobust\n\nStatistics, The Approach based on Influence Functions. John Wiley & Sons.\n\nHuber, P. (1982). Robust Statistics. John Wiley & Sons Inc.\nJacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive\n\nmixture of local experts. Neural Computation, 3:79-87.\n\nKass, G. (1980). An exploratory technique for investigating large quantities of\n\ncategorical data. Applied Statistics, 29(2):119-127.\n\nMcCullagh, P. and NeIder, J. (1989). Generalized Linear Models. Chapman and\n\nHall, London.\n\n'\n\nRousseeuw, P. and Leroy, A. (1987). Robust Regression and Outlier Detection. John\n\nWiley & Sons Inc.\n\nVapnik, V. (1998). Statistical Learning Theory. Wiley, Lecture Notes in Economics\n\nand Mathematical Systems, volume 454.\n\n\f", "award": [], "sourceid": 2062, "authors": [{"given_name": "Nicolas", "family_name": "Chapados", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Pascal", "family_name": "Vincent", "institution": null}, {"given_name": "Joumana", "family_name": "Ghosn", "institution": null}, {"given_name": "Charles", "family_name": "Dugas", "institution": null}, {"given_name": "Ichiro", "family_name": "Takeuchi", "institution": null}, {"given_name": "Linyan", "family_name": "Meng", "institution": null}]}