{"title": "Fast and Flexible Monotonic Functions with Ensembles of Lattices", "book": "Advances in Neural Information Processing Systems", "page_first": 2919, "page_last": 2927, "abstract": "For many machine learning problems, there are some inputs that are known to be positively (or negatively) related to the output, and in such cases training the model to respect that monotonic relationship can provide regularization, and makes the model more interpretable. However, flexible monotonic functions are computationally challenging to learn beyond a few features. We break through this barrier by learning ensembles of monotonic calibrated interpolated look-up tables (lattices). A key contribution is an automated algorithm for selecting feature subsets for the ensemble base models. We demonstrate that compared to random forests, these ensembles produce similar or better accuracy, while providing guaranteed monotonicity consistent with prior knowledge, smaller model size and faster evaluation.", "full_text": "Fast and Flexible Monotonic Functions with\n\nEnsembles of Lattices\n\nK. Canini, A. Cotter, M. R. Gupta, M. Milani Fard, J. Pfeifer\n\nGoogle Inc.\n\n1600 Amphitheatre Parkway, Mountain View, CA 94043\n\n{canini,acotter,mayagupta,janpf,mmilanifard}@google.com\n\nAbstract\n\nFor many machine learning problems, there are some inputs that are known to\nbe positively (or negatively) related to the output, and in such cases training the\nmodel to respect that monotonic relationship can provide regularization, and makes\nthe model more interpretable. However, \ufb02exible monotonic functions are com-\nputationally challenging to learn beyond a few features. We break through this\nbarrier by learning ensembles of monotonic calibrated interpolated look-up ta-\nbles (lattices). A key contribution is an automated algorithm for selecting feature\nsubsets for the ensemble base models. We demonstrate that compared to random\nforests, these ensembles produce similar or better accuracy, while providing guar-\nanteed monotonicity consistent with prior knowledge, smaller model size and faster\nevaluation.\n\n1\n\nIntroduction\n\nA long-standing challenge in machine learning is to learn \ufb02exible monotonic functions [1] for\nclassi\ufb01cation, regression, and ranking problems. For example, all other features held constant, one\nwould expect the prediction of a house\u2019s cost to be an increasing function of the size of the property.\nA regression trained on noisy examples and many features might not respect this simple monotonic\nrelationship everywhere, due to over\ufb01tting. Failing to capture such simple relationships is confusing\nfor users. Guaranteeing monotonicity enables users to trust that the model will behave reasonably\nand predictably in all cases, and enables them to understand at a high-level how the model responds\nto monotonic inputs [2]. Prior knowledge about monotonic relationships can also be an effective\nregularizer [3].\n\nFigure 1: Contour plots for an ensemble of three lattices, each of which is a linearly-interpolated\nlook-up table. The \ufb01rst lattice f1 acts on features 1 and 2, the second lattice f2 acts on features 2\nand 3, and f3 acts on features 3 and 4. Each 2\u00d72 look-up table has four parameters: for example,\nfor f1(x) the parameters are \u03b81 = [0, 0.9, 0.8, 1]. Each look-up table parameter is the function value\nfor an extreme input, for example, f1([0, 1]) = \u03b81[2] = 0.9. The ensemble function f1 + f2 + f3 is\nmonotonic with respect to features 1, 2 and 3, but not feature 4.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n0.91.00.80.0f1(x)feature 1feature 20.01.00.50.0f2(x)feature 3feature 20.40.41.00.0f3(x)feature 3feature 400.20.40.60.81\fTable 1: Key notation used in the paper.\n\nSymbol\nD\nD\nS\nL\ns(cid:96) \u2282 D\n(xi, yi) \u2208 [0, 1]D \u00d7 R\nx[s(cid:96)] \u2208 [0, 1]S\n\u03a6(x) : [0, 1]D \u2192 [0, 1]2D\nv \u2208 {0, 1}D\n\u03b8 \u2208 R2D\n\nDe\ufb01nition\nnumber of features\nset of features 1, 2, . . . , D\nnumber of features in each lattice\nnumber of lattices in the ensemble\n(cid:96)th lattice\u2019s set of S feature indices\nith training example\nx for feature subset s(cid:96)\nlinear interpolation weights for x\na vertex of a 2D lattice\nparameter values for a 2D lattice\n\nSimple monotonic functions can be learned with a linear function forced to have positive coef\ufb01cients,\nbut learning \ufb02exible monotonic functions is challenging. For example, multi-dimensional isotonic\nregression has complexity O(n4) for n training examples [4]. Other prior work learned monotonic\nfunctions on small datasets and very low-dimensional problems; see [2] for a survey. The largest\ndatasets used in recent papers on training monotonic neural nets included two features and 3434\nexamples [3], one feature and 30 examples [5], and six features and 174 examples [6]. Another\napproach that uses an ensemble of rules on a modi\ufb01ed training set, scales poorly with the dataset size\nand does not provide monotonicity guarantees on test samples [7].\nRecently, monotonic lattice regression was proposed [2], which extended lattice regression [8] to\nlearn a monotonic interpolated look-up table by adding linear inequality constraints to constrain\nadjacent look-up table parameters to be monotonic. See Figure 1 for an illustration of three such\nlattice functions. Experiments on real-world problems with millions of training examples showed\nsimilar accuracy to random forests [9], which generally perform well [10], but with the bene\ufb01t of\nguaranteed monotonicity. Monotonic functions on up to sixteen features were demonstrated, but that\napproach is fundamentally limited in its ability to scale to higher input dimensions, as the number of\nparameters for a lattice model scales as O(2D).\nIn this paper, we break through previous barriers on D by proposing strategies to learn a monotonic\nensemble of lattices. The main contributions of this paper are: (i) proposing different architectures\nfor ensembles of lattices with different trade-offs for \ufb02exibility, regularization and speed, (ii) theorem\nshowing lattices can be merged, (iii) an algorithm to automatically learn feature subsets for the base\nmodels, (iv) extensive experimental analysis on real data showing results that are similar to or better\nthan random forests, while respecting prior knowledge about monotonic features.\n\n2 Ensemble of Lattices\nConsider the usual supervised machine learning set-up of training sample pairs {(xi, yi)} for i =\n1, . . . , n. The label is either a real-valued yi \u2208 R, or a binary classi\ufb01cation label yi \u2208 {\u22121, 1}.\nWe assume that sensible upper and lower bounds can be set for each feature, and without loss of\ngenerality, the feature is then scaled so that xi \u2208 [0, 1]D. Key notation is summarized in Table 1. We\npropose learning a weighted ensemble of L lattices:\n\nL(cid:88)\n\n2\n\nF (x) = \u03b10 +\n\n\u03b1(cid:96)f (x[s(cid:96)]; \u03b8(cid:96)),\n\n(1)\n\nwhere each lattice f (x[s(cid:96)]; \u03b8(cid:96)) is a lattice function de\ufb01ned on a subset of features s(cid:96) \u2282 D, and x[s(cid:96)]\ndenotes the S \u00d7 1 vector with the components of x corresponding to the feature set s(cid:96). We require\neach lattice to have S features, i.e. |sl| = S and \u03b8l \u2208 R2S for all l. The (cid:96)th lattice is a linearly\ninterpolated look-up table de\ufb01ned on the vertices of the S-dimensional unit hypercube:\n\n(cid:96)=1\n\nf (x[s(cid:96)]; \u03b8(cid:96)) = \u03b8T\n\n(cid:96) \u03a6(x[s(cid:96)]),\n\nwhere \u03b8(cid:96) \u2208 R2S are the look-up table parameters, and \u03a6(x[s]) : [0, 1]S \u2192 [0, 1]2S are the linear\ninterpolation weights for x in the (cid:96)th lattice. See Appendix A for a review of multilinear and simplex\ninterpolations.\n\n\f2.1 Monotonic Ensemble of Lattices\nDe\ufb01ne a function f to be monotonic with respect to feature d if f (x) \u2265 f (z) for any two feature\nvectors x, z \u2208 RD that satisfy x[d] \u2265 z[d] and x[m] = z[m] for m (cid:54)= d. A lattice function is\nmonotonic with respect to a feature if the look-up table parameters are nondecreasing as that feature\nincreases, and a lattice can be constrained to be monotonic with an appropriate set of sparse linear\ninequality constraints [2].\nTo guarantee that an ensemble of lattices is monotonic with respect to feature d, it is suf\ufb01cient that\neach lattice be monotonic in feature d and that we have positive weights \u03b1(cid:96) \u2265 0 for all (cid:96), by linearity\nof the sum over lattices in (1). Constraining each base model to be monotonic provides only a subset\nof the possible monotonic ensemble functions, as there could exist monotonic functions of the form\n(1) that are not monotonic in each f(cid:96)(x).\nIn this paper, for notational simplicity and because we \ufb01nd it the most empirically useful case for\nmachine learning, we only consider lattices that have two vertices along each dimension, so a look-up\ntable on S features has 2S parameters. Generalization to larger look-up tables is trivial [11, 2].\n\n2.2 Ensemble of Lattices Can Be Merged into One Lattice\n\nWe show that an ensemble of lattices as expressed by (1) is equivalent to a single lattice de\ufb01ned on all\nD features. This result is important for two practical reasons. First, it shows that training an ensemble\nover subsets of features is a regularized form of training one lattice with all D features. Second, this\nresult can be used to merge small lattices that share many features into larger lattices on the superset\nof their features, which can be useful to reduce evaluation time and memory use of the ensemble\nmodel.\n\nTheorem 1 Let F (x) := (cid:80)L\n\nl \u03a6(x[sl]) as described in (1) where \u03a6 are either multilinear\nor simplex interpolation weights. Then there exists a \u03b8 \u2208 R2D such that F (x) = \u03b8T \u03a6(x) for all\nx \u2208 [0, 1]D.\n\nl=1 \u03b1l\u03b8T\n\nThe proof and an illustration is given in the supplementary material.\n\n3 Feature Subset Selection for the Base Models\n\nA key problem is which features should be combined in each lattice. The random subspace method\n[12] and many variants of random forests randomly sample subsets of features for the base models.\nSampling the features to increase the likelihood of selecting combinations of features that better\ncorrelate with the label can produce better random forests [13]. Ye et al. \ufb01rst estimated the informa-\ntiveness of each feature by \ufb01tting a linear model and dividing the features up into two groups based\non the linear coef\ufb01cients, then randomly sampled the features considered for each tree from both\ngroups, to ensure strongly informative features occur in each tree [14]. Others take the random subset\nof features for each tree as a starting point, but then create more diverse trees. For example, one can\nincrease the diversity of the features in a tree ensemble by changing the splitting criterion [15], or by\nmaximizing the entropy of the joint distribution over the leaves [16].\nWe also consider randomly selecting the subset of S features for each lattice uniformly and indepen-\ndently without replacement from the set of D features, we refer to this as a random tiny lattice (RTL).\nTo improve the accuracy of ensemble selection, we can draw K independent RTL ensembles using\nK different random seeds, train each ensemble independently, and select the one with the lowest\ntraining or validation or cross-validation error. This approach treats the random seed that generates\nthe RTL as a hyper-parameter to optimize, and in the extreme of suf\ufb01ciently large K, this strategy\ncan be used to minimize the empirical risk. The computation scales linearly in K, but the K training\njobs can be parallelized.\n\n3.1 Crystals: An Algorithm for Selecting Feature Subsets\n\nAs a more ef\ufb01cient alternative to randomly selecting feature subsets, we propose an algorithm we\nterm Crystals to jointly optimize the selection of the L feature subsets. Note that linear interactions\nbetween features can be captured by the ensemble\u2019s linear combination of base models, but nonlinear\n\n3\n\n\finteractions must be captured by the features occurring together in a base model. This motivates us\nto propose choosing the feature subsets for each base model based on the importance of pairwise\nnonlinear interactions between the features.\nTo measure pairwise feature interactions, we \ufb01rst separately (and in parallel) train lattices on all\n\n(cid:1) lattices, each with S = 2 features. Then, we measure\n\npossible pairs of features, that is L =(cid:0)D\n\nthe nonlinearity of the interaction of any two features d and \u02dcd by the torsion of their lattice, which\nis the squared difference between the slopes of the lattice\u2019s parallel edges [2]. Let \u03b8d, \u02dcd denote the\nparameters of the 2\u00d72 lattice for features d and \u02dcd. The torsion of the pair (d, \u02dcd) is de\ufb01ned as:\n\n2\n\n(cid:16)\n\n(cid:17)2\n(\u03b8d, \u02dcd[1] \u2212 \u03b8d, \u02dcd[0]) \u2212 (\u03b8d, \u02dcd[3] \u2212 \u03b8d, \u02dcd[2])\n\ndef\n=\n\n\u03c4d, \u02dcd\n\n.\n\n(2)\n\nIf the lattice trained on features d and \u02dcd is just a linear function, its torsion is \u03c4d, \u02dcd = 0, whereas a large\ntorsion value implies a nonlinear interaction between the features. Given the torsions of all pairs\nof features, we propose using the L feature subsets {s(cid:96)} that maximize the weighted total pairwise\ntorsion of the ensemble:\n\nH({s(cid:96)})\n\ndef\n=\n\nI(d, \u02dcd\u2208s(cid:96)(cid:48) ),\n\n(3)\n\nL(cid:88)\n\n(cid:88)\n\n(cid:96)=1\n\nd, \u02dcd\u2208s(cid:96)\nd(cid:54)= \u02dcd\n\n(cid:80)(cid:96)\u22121\n\n(cid:96)(cid:48)=1\n\n\u03c4d, \u02dcd \u03b3\n\nwhere I is an indicator. The discount value \u03b3 is a hyperparameter that controls how much the value\nof a pair of features diminishes when repeated in multiple lattices. The extreme case \u03b3 = 1 makes\nthe objective (3) optimized by the degenerate case that all L lattices include the same feature pairs\nthat have the highest torsion. The other extreme of \u03b3 = 0 only counts the \ufb01rst inclusion of a pair\nof features in the ensemble towards the objective, and results in unnecessarily diverse lattices. We\nfound a default of \u03b3 = 1\n2 generally produces good, diverse lattices, but \u03b3 can also be optimized as a\nhyperparameter.\nIn order to select the subsets {s(cid:96)}, we \ufb01rst choose the number of times each feature is going to be\nused in the ensemble. We make sure each feature is used at least once and then assign the rest of the\nfeature counts proportional to the median of each feature\u2019s torsion with other features. We initialize\na random ensemble that satis\ufb01es the selected feature counts and then try to maximize the objective\nin (3) using a greedy swapping method: We loop over all pairs of lattices, and swap any features\nbetween them that increase H({s(cid:96)}), until no swaps improve the objective. This optimization takes a\nsmall fraction of the total training time for the ensemble. One can potentially improve the objective\nby using a stochastic annealing procedure, but we \ufb01nd our deterministic method already yields good\nsolutions in practice.\n\n4 Calibrating the Features\n\nAccuracy can be increased by calibrating each feature with a one-dimensional monotonic piecewise\nlinear function (PLF) before it is combined with other features [17, 18, 2]. These calibration functions\ncan approximate log, sigmoidal, and other useful feature pre-processing transformations, and can be\ntrained as part of the model.\nFor an ensemble, we consider two options. Either a set of D calibrators shared across the base models\n(one PLF per feature), or L sets of calibrators, one set of S calibrators for each base model, for a total\nof LS calibrators. Use of separate calibrators provides more \ufb02exibility, but increases the potential for\nover\ufb01tting, increases evaluation time, and removes the ability to merge lattices.\nLet c(x[s(cid:96)]; \u03bd(cid:96)) : [0, 1]S \u2192 [0, 1]S denote the vector-valued calibration function on the feature subset\ns(cid:96) with calibration parameters \u03bd(cid:96). The ensemble function will thus be:\n\nSeparate Calibration: F (x) = \u03b10 +\n\nShared Calibration: F (x) = \u03b10 +\n\n\u03b1(cid:96)f (c(x[s(cid:96)]; \u03bd(cid:96)); \u03b8(cid:96))\n\n\u03b1(cid:96)f (c(x[s(cid:96)]; \u03bd[s(cid:96)]); \u03b8(cid:96)),\n\n(4)\n\n(5)\n\nL(cid:88)\nL(cid:88)\n\n(cid:96)=1\n\n(cid:96)=1\n\n4\n\n\fwhere \u03bd is the set of all shared calibration parameters and \u03bd[s(cid:96)] is the subset corresponding to the\nfeature set s(cid:96). Note that Theorem 1 holds for shared calibrators, but does not hold for separate\ncalibrators.\nWe implement these PLF\u2019s as in [2]. The PLF\u2019s are monotonic if the adjacent parameters in each\nPLF are monotonic, which can be enforced with additional linear inequality constraints [2]. By\ncomposition, if the calibration functions for monotonic features are monotonic, and the lattices are\nmonotonic with respect to those features, then the ensemble with positive weights on the lattices is\nmonotonic.\n\n5 Training the Lattice Ensemble\n\nThe key question in training the ensemble is whether to train the base models independently, as is\nusually done in random forests, or to train the base models jointly, akin to generalized linear models.\n\n5.1\n\nJoint Training of the Lattices\n\nIn this setting, we optimize (1) jointly over all calibration and lattice parameters, in which case each\n\u03b1(cid:96) is subsumed by the corresponding base model parameters \u03b8(cid:96) and can therefore be ignored. Joint\ntraining allows the base models to specialize, and increases the \ufb02exibility, but can slow down training\nand is more prone to over\ufb01tting for the same choice of S.\n\n5.2 Parallelized, Independent Training of the Lattices\n\nWe can train the lattices independently in parallel much faster, and then \ufb01t the weights \u03b1t in (1) in a\nsecond post-\ufb01tting stage as described in Step 5 below.\nStep 1: Initialize the calibration function parameters \u03bd and each tiny lattice\u2019s parameters \u03b8(cid:96).\nStep 2: Train the L lattices in parallel; for the (cid:96)th lattice, solve the monotonic lattice regression\nproblem [2]:\n\n\u03b8(cid:96)\n\narg min\n\nL(f (c(xi[s(cid:96)]; \u03bd(cid:96)); \u03b8(cid:96)), yi) + \u03bbR(\u03b8(cid:96)), such that A\u03b8 \u2264 0,\n\n(6)\nwhere A\u03b8 \u2264 0 captures the linear inequality constraints needed to enforce monotonicity for whichever\nfeatures are required, L is a convex loss function, and R denotes a regularizer on the lattice parameters.\nStep 3: If separate calibrators are used as per (4) their parameters can be optimized jointly with\nthe lattice parameters in (6). If shared calibrators are used, we hold all lattice parameters \ufb01xed and\noptimize the shared calibration parameters \u03bd:\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\narg min\n\n\u03bd\n\nL(F (xi ; \u03b8, \u03bd, \u03b1), yi), such that B\u03bd \u2264 0.\n\nF is as de\ufb01ned in (5), and B\u03bd \u2264 0 speci\ufb01es the linear inequality constraints needed to enforce\nmonotonicity of each of the piecewise linear calibration functions.\nStep 4: Loop on Steps 2 and 3 until convergence.\nStep 5: Post-\ufb01t the weights \u03b1 \u2208 RL over the ensemble:\n\nL(F (xi ; \u03b8, \u03bd, \u03b1), yi) + \u03b3 \u02dcR(\u03b1), such that \u03b1(cid:96) \u2265 0 for all (cid:96),\n\n(7)\n\nn(cid:88)\n\ni=1\n\narg min\n\n\u03b1\n\nwhere \u02dcR(\u03b1) is a regularizer on \u03b1. For example, regularizing \u03b1 with the (cid:96)1 norm encourages a\nsparser ensemble, which can be useful for improving evaluation speed. Regularizing \u03b1 with a ridge\nregularizer makes the post\ufb01t more similar to averaging the base models, reducing variance.\n\n6 Fast Evaluation\n\nThe proposed lattice ensembles are fast to evaluate. The evaluation complexity of simplex interpola-\ntion of a lattice ensemble with L lattices each of size S is O(LS log S), but in practice one encounters\n\n5\n\n\fTable 2: Details for the datasets used in the experiments.\n\nDataset\n1\n2\n3\n4\n\nProblem\nClassi\ufb01cation\nRegression\nClassi\ufb01cation\nClassi\ufb01cation\n\nFeatures Monotonic\n\nTrain Validation\n\nTest 1\n\n12\n12\n54\n29\n\n4\n10\n9\n21\n\n29 307\n115 977\n500 000\n88 715\n\n9769\n-\n100 000\n11 071\n\n9769\n31 980\n200 000\n11 150\n\nTest 2\n-\n-\n-\n65 372\n\n\ufb01xed costs, and caching ef\ufb01ciency that depends on the model size. For example, for Experiment 3\nwith D = 50 features, with C++ implementations of both Crystals and random forests both evaluating\nthe base models sequentially, the random forests takes roughly 10\u00d7 as long to evaluate, and takes\nroughly 10\u00d7 as much memory as the Crystals. See Appendix C.5 for further discussions and more\ntiming results.\nEvaluating calibration functions can add notably to the overall evaluation time. If evaluating the\naverage calibration function takes time c, with shared calibrators the eval time is O(cD + LS log S)\nbecause the D calibrators can be evaluated just once, but with separate calibrators, the eval time\nis generally worse at O(LS(c + log S)). For practical problems, evaluating separate calibrators\nmay be one-third of the total evaluation time, even when implemented with an ef\ufb01cient binary\nsearch. However, if evaluation can be ef\ufb01ciently parallelized with multi-threading, then there is little\ndifference in evaluation time between shared and separate calibrators.\nIf shared (or no) calibrators are used, merging the ensemble\u2019s lattices into fewer larger lattices (see\nSec. 2.2) can greatly reduce evaluation time. For example, for a problem D = 14 features, we found\nthe accuracy was best if we trained an ensemble of 300 lattices with S = 9 features each. The resulting\nensemble had 300\u00d729 = 153 600 parameters. We then merged all 300 lattices into one equivalent 214\nlattice with only 16 384 parameters, which reduced both the memory and the evaluation time by a\nfactor of 10.\n\n7 Experiments\n\nWe demonstrate the proposals on four datasets. Dataset 1 is the ADULT dataset from the UCI Machine\nLearning Repository [19], and the other datasets are provided by product groups from Google, with\nmonotonicity constraints for certain features given by the corresponding product group. See Table\n2 for details. To ef\ufb01ciently handle the large number of linear inequality constraints (\u223c100 000\nconstraints for some of these problems) when training a lattice or ensemble of lattices, we used\nLightTouch [20].\nWe compared to random forests (RF) [9], an ensemble method that consistently performs well for\ndatasets this size [10]. However, RF makes no attempt to respect the monotonicity constraints, nor is\nit easy to check if an RF is monotonic. We used a C++ package implementation for RF.\nAll the hyper parameters were optimized on validation sets. Please see the supplemental for further\nexperimental details.\n\n7.1 Experiment 1 - Monotonicity as a Regularizer\n\nIn the \ufb01rst experiment, we compare the accuracy of different models in predicting whether income is\ngreater than $50k for the ADULT dataset (Dataset 1). We compare four models on this dataset: (I)\nrandom forest (RF), (II) single unconstrained lattice, (III) single lattice constrained to be monotonic\nin 4 features, (IV) an ensemble of 50 lattices with 5 features in each lattice and separate calibrators\nfor each lattice, jointly trained. For the constrained models, we set the function to be monotonically\nincreasing in capital-gain, weekly hours of work and education level, and the gender wage gap [21].\nResults over 5 runs of each algorithm is shown in Table 3 demonstrate how monotonicity can act as\na regularizer to improve the testing accuracy: the monotonic models have lower training accuracy,\nbut higher test accuracy. The ensemble of lattices also improves accuracy over the single lattice, we\nhypothesize because the small lattices provide useful regularization, while the separate calibrators\nand ensemble provide helpful \ufb02exibility. See the appendix for a more detailed analysis of the results.\n\n6\n\n\fTable 3: Accuracy of different models on the ADULT dataset from UCI.\n\nRandom Forest\nUnconstrained Lattice\nMonotonic Lattice\nMonotonic Crystals\n\nTraining\nAccuracy\n90.56 \u00b1 0.03\n86.34 \u00b1 0.00\n86.29 \u00b1 0.02\n86.25 \u00b1 0.02\n\nTesting\nAccuracy\n85.21 \u00b1 0.01\n84.96 \u00b1 0.00\n85.36 \u00b1 0.03\n85.53 \u00b1 0.04\n\n7.2 Experiment 2 - Crystals vs. Random Search for Feature Subsets Selection\n\nThe second experiment on Dataset 2 is a regression to score the quality of a candidate for a matching\nproblem on a scale of [0, 4]. The training set consists of 115 977 past examples, and the testing set\nconsists of 31 980 more recent examples (thus the samples are not IID). There are D = 12 features,\nof which 10 are constrained to be monotonic on all the models compared for this experiment. We\nuse this problem with its small number of features to illustrate the effect of the feature subset choice,\ncomparing to RTLs optimized over K = 10 000 different trained ensembles, each with different\nrandom feature subsets.\nAll ensembles were restricted to S = 2 features per base model, so there were only 66 distinct feature\n\nsubsets possible, and thus for a L = 8 lattice ensemble, there were(cid:0)66\n\nsubsets. Ten-fold cross-validation was used to select an RTL out of 1, 10, 100, 1000, or 10 000 RTLs\nwhose feature subsets were randomized with different random seeds. We compared to a calibrated\nlinear model and a calibrated single lattice on all D = 12 features. See the supplemental for further\nexperimental details.\nFigure 2 shows the normalized mean squared error (MSE divided by label variance, which is 0 for\nthe oracle model, and 1 for the best constant model). Results show that Crystals (orange line) is\nsubstantially better than a random draw of the feature subsets (light blue line), and for mid-sized\nensembles (e.g. L = 32 lattices), Crystals can provide very large computational savings (1000\u00d7) over\nthe RTL strategy of randomly considering different feature subsets.\n\n(cid:1) (cid:39) 5.7\u00d7109 possible feature\n\n8\n\n7.3 Experiment 3 - Larger-Scale Classi\ufb01cation: Crystals vs. Random Forest\n\nThe third experiment on Dataset 3 is to classify whether a candidate result is a good match to a user.\nThere are D = 54 features, of which 9 were constrained to be monotonic. We split 800k labelled\nsamples based on time, using the 500k oldest samples for a training set, the next 100k samples for a\nvalidation set, and the most recent 200k samples for a testing set (so the three datasets are not IID).\nResults over 5 runs of each algorithm in Figure 3 show that Crystals ensemble is about 0.25% -\n0.30% more accurate on the testing set over a broad range of ensemble sizes. The best RF on the\nvalidation set used 350 trees with a leaf size of 1 and the best Crystals model used 350 lattices with 6\nfeatures per lattice. Because the RF hyperparameter validation chose to use a minimum leaf size of 1,\n\nFigure 2: Comparison of normalized mean\nsquared test error on Dataset 2. Average stan-\ndard error is less than 10\u22124.\n\nFigure 3: Test accuracy on Dataset 3 over\nthe number of base models (trees or lattices).\nError bars are standard errors.\n\n7\n\n81632641280.710.720.730.740.750.760.77Number of 2D LatticesMean Squared Error / Variance LinearFull latticeCrystalsRTL \u00d7 1RTL \u00d7 10RTL \u00d7 100RTL \u00d7 1,000RTL \u00d7 10,000 .05010015020025030035040094.959595.0595.195.1595.295.2595.395.3595.4Number of Base ModelsAverage Test Accuracy CrystalsRandom Forests\fTable 4: Results on Dataset 4. Test Set 1 has the same distribution as the training set. Test Set 2 is\na more realistic test of the task. Left: Crystals vs. random forests. Right: Comparison of different\noptimization algorithms.\n\nTest Set 1\nAccuracy\n75.23 \u00b1 0.06\n75.18 \u00b1 0.05\n\nTest Set 2\nAccuracy\n90.51 \u00b1 0.01\n91.15 \u00b1 0.05\n\nRandom Forest\nCrystals\n\nLattices\nTraining\nJoint\nIndependent\nJoint\n\nCalibrators\nTest Set 1\nPer Lattice Accuracy\nSeparate\nSeparate\nShared\n\n74.78 \u00b1 0.04\n72.80 \u00b1 0.04\n74.48 \u00b1 0.04\n\nthe size of the RF model for this dataset scaled linearly with the dataset size, to about 1GB. The large\nmodel size severely affects memory caching, and combined with the deep trees in the RF ensemble,\nmakes both training and evaluation an order of magnitude slower than for Crystals.\n\n7.4 Experiment 4 - Comparison of Optimization Algorithms\n\nThe fourth experiment on Dataset 4 is to classify whether a speci\ufb01c visual element should be shown\non a webpage. There are D = 29 features, of which 21 were constrained to be monotonic. The Train\nSet, Validation Set, and Test Set 1 were randomly split from one set of examples, whose sampling\ndistribution was skewed to sample mostly dif\ufb01cult examples. In contrast, Test Set 2 was uniformly\nrandomly sampled, with samples from a larger set of countries, making it a more accurate measure of\nthe expected accuracy in practice.\nAfter optimizing the hyperparameters on the validation set, we independently trained 10 RF and 10\nCrystal models, and report the mean test accuracy and standard error on the two test sets in Table\n4. On Test Set 1, which was split off from the same set of dif\ufb01cult examples as the Train Set and\nValidation Set, the random forests and Crystals perform statistically similar. On Test Set 2, which\nis six times larger and was sampled uniformly and from a broader set of countries, the Crystals\nare statistically signi\ufb01cantly better. We believe this demonstrates that the imposed monotonicity\nconstraints effectively act as regularizers and help the lattice ensemble generalize better to parts of\nthe feature space that were sparser in the training data.\nIn a second experiment with Dataset 4, we used RTLs to illustrate the effects of shared vs. separate\ncalibrators, and training the lattices jointly vs. independently.\nWe \ufb01rst constructed an RTL model with 500 lattices of 5 features each and a separate set of calibrators\nfor each lattice, and trained the lattices jointly as a single model. We then separately modi\ufb01ed this\nmodel in two different ways: (1) training the lattices independently of each other and then learning an\noptimal linear combination of their predictions, and (2) using a single, shared set of calibrators for all\nlattices. All models were trained using logistic loss, mini-batch size of 100, and 200 loops. For each\nmodel, we chose the optimization algorithms\u2019 step sizes by \ufb01nding the power of 2 that maximized\naccuracy on the validation set.\nTable 4 (right) shows that joint training and separate calibrators for the different lattices can provide a\nnotable and statistically signi\ufb01cantly increase in accuracy, due to the greater \ufb02exibility.\n\n8 Conclusions\n\nThe use of machine learning has become increasingly popular in practice. That has come with a\ngreater demand for machine learning that matches the intuitions of domain experts. Complex models,\neven when highly accurate, may not be accepted by users who worry the model may not generalize\nwell to new samples.\nMonotonicity guarantees can provide an important sense of control and understanding on learned\nfunctions. In this paper, we showed how ensembles can be used to learn the largest and most\ncomplicated monotonic functions to date. We proposed a measure of pairwise feature interactions\nthat can identify good feature subsets in a fraction of the computation needed for random feature\nselection. On real-world problems, we showed these monotonic ensembles provide similar or better\naccuracy, and faster evaluation time compared to random forests, which do not provide monotonicity\nguarantees.\n\n8\n\n\fReferences\n[1] Y. S. Abu-Mostafa. A method for learning from hints. In Advances in Neural Information\n\nProcessing Systems, pages 73\u201380, 1993.\n\n[2] M. R. Gupta, A. Cotter, J. Pfeifer, K. Voevodski, K. Canini, A. Mangylov, W. Moczydlowski,\nand A. Van Esbroeck. Monotonic calibrated interpolated look-up tables. Journal of Machine\nLearning Research, 17(109):1\u201347, 2016.\n\n[3] C. Dugas, Y. Bengio, F. B\u00e9lisle, C. Nadeau, and R. Garcia. Incorporating functional knowledge\n\nin neural networks. Journal Machine Learning Research, 2009.\n\n[4] J. Spouge, H. Wan, and W. J. Wilbur. Least squares isotonic regression in two dimensions.\n\nJournal of Optimization Theory and Applications, 117(3):585\u2013605, 2003.\n\n[5] Y.-J. Qu and B.-G. Hu. Generalized constraint neural network regression model subject to linear\n\npriors. IEEE Trans. on Neural Networks, 22(11):2447\u20132459, 2011.\n\n[6] H. Daniels and M. Velikova. Monotone and partially monotone neural networks. IEEE Trans.\n\nNeural Networks, 21(6):906\u2013917, 2010.\n\n[7] W. Kotlowski and R. Slowinski. Rule learning with monotonicity constraints. In Proceedings of\nthe 26th Annual International Conference on Machine Learning, pages 537\u2013544. ACM, 2009.\n\n[8] E. K. Garcia and M. R. Gupta. Lattice regression. In Advances in Neural Information Processing\n\nSystems (NIPS), 2009.\n\n[9] L. Breiman. Random forests. Machine Learning, 45(1):5\u201332, 2001.\n\n[10] M. Fernandez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundreds of\nclassi\ufb01ers to solve real world classi\ufb01cation problems? Journal Machine Learning Research,\n2014.\n\n[11] E. K. Garcia, R. Arora, and M. R. Gupta. Optimized regression for ef\ufb01cient function evaluation.\n\nIEEE Trans. Image Processing, 21(9):4128\u20134140, Sept. 2012.\n\n[12] T. Ho. Random subspace method for constructing decision forests. IEEE Trans. on Pattern\n\nAnalysis and Machine Intelligence, 20(8):832\u2013844, 1998.\n\n[13] D. Amaratunga, J. Cabrera, and Y. S. Lee. Enriched random forests. Bioinformatics, 24(18),\n\n2008.\n\n[14] Y. Ye, Q. Wu, J. Z. Huang, M. K. Ng, and X. Li. Strati\ufb01ed sampling for feature subspace\n\nselection in random forests for high-dimensional data. Pattern Recognition, 2013.\n\n[15] L. Zhang and P. N. Suganthan. Random forests with ensemble of feature spaces. Pattern\n\nRecognition, 2014.\n\n[16] H. Xu, G. Chen, D. Povey, and S. Khudanpur. Modeling phonetic context with non-random\n\nforests for speech recognition. Proc. Interspeech, 2015.\n\n[17] G. Sharma and R. Bala. Digital Color Imaging Handbook. CRC Press, New York, 2002.\n\n[18] A. Howard and T. Jebara. Learning monotonic transformations for classi\ufb01cation. In Advances\n\nin Neural Information Processing Systems, 2007.\n\n[19] C. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.\n\n[20] A. Cotter, M. R. Gupta, and J. Pfeifer. A Light Touch for heavily constrained SGD. In 29th\n\nAnnual Conference on Learning Theory, pages 729\u2013771, 2016.\n\n[21] D. Weichselbaumer and R. Winter-Ebmer. A meta-analysis of the international gender wage\n\ngap. Journal of Economic Surveys, 19(3):479\u2013511, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1464, "authors": [{"given_name": "Mahdi", "family_name": "Milani Fard", "institution": "Google"}, {"given_name": "Kevin", "family_name": "Canini", "institution": "Google"}, {"given_name": "Andrew", "family_name": "Cotter", "institution": "Google"}, {"given_name": "Jan", "family_name": "Pfeifer", "institution": "Google"}, {"given_name": "Maya", "family_name": "Gupta", "institution": "Google"}]}