{"title": "Diminishing Returns Shape Constraints for Interpretability and Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 6834, "page_last": 6844, "abstract": "We investigate machine learning models that can provide diminishing returns and accelerating returns guarantees to capture prior knowledge or policies about how outputs should depend on inputs. We show that one can build flexible, nonlinear, multi-dimensional models using lattice functions with any combination of concavity/convexity and monotonicity constraints on any subsets of features, and compare to new shape-constrained neural networks. We demonstrate on real-world examples that these shape constrained models can provide tuning-free regularization and improve model understandability.", "full_text": "Diminishing Returns Shape Constraints for\n\nInterpretability and Regularization\n\nMaya R. Gupta, Dara Bahri, Andrew Cotter, Kevin Canini\n\nGoogle AI\n\n1600 Charleston Rd\n\nMountain View, CA 94043\n\n{mayagupta,dbahri,acotter,canini}@google.com\n\nAbstract\n\nWe investigate machine learning models that can provide diminishing returns\nand accelerating returns guarantees to capture prior knowledge or policies\nabout how outputs should depend on inputs. We show that one can build\n\ufb02exible, nonlinear, multi-dimensional models using lattice functions with any\ncombination of concavity/convexity and monotonicity constraints on any\nsubsets of features, and compare to new shape-constrained neural networks.\nWe demonstrate on real-world examples that these shape constrained models\ncan provide tuning-free regularization and improve model understandability.\n\n1 Introduction\n\nDiminishing returns are common in physical systems, human perception and psychology [1],\nand have been recognized in economics [2] and agriculture [3] for centuries. For example,\na model that predicts how much a renter will like an apartment should predict a strong\npreference for 60 square meters of living space over 50 square meters, but a smaller preference\nfor 100 square meters over 90 square meters, if everything else is the same. Similarly, a\nmodel that predicts the time it will take a customer to grocery shop should decrease in the\nnumber of cashiers, but each added cashier reduces average wait time by less. In both cases,\nwe would like to be able to incorporate this prior knowledge by constraining the machine\nlearned model\u2019s output to have a diminishing returns response to the size of the apartment\nor number of cashiers. Mathematically, we say a function has diminishing returns with\nrespect to an input if the function is monotonically increasing and concave, or monotonically\ndecreasing and convex, with respect to that input. Accelerating returns are also common in\nthe real-world; for example, Adam Smith characterized labor specialization and economies of\nscale as causing accelerating returns [4]. Accelerating returns describes functions that are\nmonotonically increasing and convex, or monotonically decreasing and concave with respect\nto an input.\nWe show how one can train \ufb02exible models that capture one\u2019s prior knowledge or preference\nthat the model should exhibit diminishing or accelerating returns with respect to some inputs.\nSuch shape constraints are e\ufb00ective regularization, reducing the chance that noisy training\ndata or adversarial examples produce a model that does not behave as expected, and act as a\nmachine learning poka-yoke (mistake-proo\ufb01ng) strategy [5] when a model is re-trained with\nfresh data. Unlike most regularizers, shape constraints do not require tuning the amount of\nregularization (beyond the binary decision of whether to apply a shape constraint), and are\nespecially useful when there is domain shift between training and test distributions. Shape\nconstraints have a clear semantic meaning, and thus improve interpretability because the\nuser knows and understands at a high-level how each of the shape-constrained inputs a\ufb00ects\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fthe output. We have found in practice that shape-constrained machine-learned models are\nmuch easier to debug and analyze.\nWe investigate diminishing returns shape constraints for two \ufb02exible, nonlinear function\nclasses: neural networks and lattices. Real-world experiments illustrate how these shape\nconstraints can be useful and e\ufb00ective in practice. However, consistent with the past literature\non monotonic neural networks, we found it di\ufb03cult to control the shape constraints on neural\nnetworks as \ufb02exibly as with the lattice models. Speci\ufb01cally, for our shape-constrained neural\nnetworks (SCNNs) we could not produce a monotonic response for an input without also\nconstraining it to be convex/concave, and we must select either convex or concave constraints\n- that is, we cannot impose both for di\ufb00erent features. In contrast, we show that one can\nshape-constrain lattice models for any mixture of monotonicity and concavity/convexity\nconstraints on any subsets of the features, and achieve similar (and more stable) test metrics\nthan with unconstrained deep neural networks on a breadth of real-world problems.\nAnother di\ufb00erence is joint vs ceterus paribus convexity. Our SCNNs impose joint convex-\nity/concavity over all constrained features, whereas our lattice models\u2019 shape constraints\nare ceterus paribus: a constraint holds with respect to changes in a single feature if none\nof the other features change, but the convexity/concavity need not hold jointly over all the\nconstrained features. For example, suppose f(x) estimates the success of a party given two\nfeatures: the number of guests and the number of bottles of wine. With the proposed lattice\nmodels, one can constrain f(x) to have a ceterus paribus diminishing returns response in the\nnumber of guests (for any \ufb01xed number of bottles) or in the number of bottles of wine (for\nany \ufb01xed number of guests), but without forcing f(x) to have a diminishing returns response\nalong the diagonal direction de\ufb01ned as one bottle of wine per guest. In contrast, with\nthe proposed SCNN, imposing diminishing returns on both wine and guests produces joint\nconcavity, such that f(x) will be concave and decreasing over any line in the two-dimensional\nfeature space. We have found it easier to decide when ceterus paribus convexity is warranted\nfor a problem. Joint convexity is a stronger constraint, and we have found it harder to reason\nabout when it is warranted for real problems.\n\n2 Related Work\n\nsuch that f(x) =PD\n\nWe review the most related literature, categorized by function class. Note the term partial\nis used to mean that one can select which of the inputs is shape-constrained. Many of the\nfunction classes discussed below and in our proposals are multi-layer functions such that\nf(x) = h(g(x)). Recall from the chain rule that f00 = h00(g0)2 + g00h0. Thus if h(x) and g(x)\nare convex and increasing, then f(x) is convex and increasing, and analogously for concave\nand increasing. Also, if g(x) is convex and h(x) is convex and increasing, that is su\ufb03cient\nfor f(x) to be convex.\nGAMs: Generalized additive models (GAMs) [6] are a classic function class for imposing\nshape constraints [7]. Recall GAMs are a sum of component-wise 1-d nonlinear functions\nd=1 fd(x[d]) + b, where x \u2208 RD, each fd : R \u2192 R, and b \u2208 R is a constant.\nShape constraints are enforced by choosing appropriate parametric forms for each feature\u2019s\nfunction, fd, that obey the desired constraints (e.g., for diminishing returns, use a positively-\nscaled log function). Also well-studied are nonparametric shape-constrained GAMs [7]; for\nexample, the special case of D = 1 with monotonicity regularization is well-known as isotonic\nregression [8]. Recently, Chen et al. [9] gave an algorithm for \ufb01tting nonparametric GAM\nmodels with diminishing/accelerating returns constraints. Pya and Wood [10] also trained\nGAMs with \ufb01rst and second derivative shape constraints (see their paper for additional\nearlier work). These two methods for achieving diminishing returns performed similarly on a\nreal dataset (N = 915 examples, D = 4 features) presented in Chen et al. [9], and both were\nnotably better than unconstrained GAMs.\nMax-A\ufb03ne (Max-Pooling): Convex piecewise-linear models [11, 12] take advantage of\nthe fact that any convex piecewise-linear function can be expressed as a max-a\ufb03ne function\n[13] (analogously, use min for concavity): f(x) = maxk{AT\nk x + bk}. Earlier, Sill [14] used\nmax-a\ufb03ne functions followed by a min-pooling layer to form a three-layer min-max network to\n\n2\n\n\fjkx+ bjk}, with the appropriate components\n\nlearn monotonic functions, f(x) = minj maxk{AT\nof A constrained to be positive.\nNeural Networks: One can constrain a neural network to be monotonic by restricting the\nnetwork weights to all be positive [15, 16, 17, 18, 19]. However, this strategy signi\ufb01cantly\nreduces expressibility [20]. Speci\ufb01cally, if you constrain a neural net with ReLU activations\nto be increasing in x by constraining its weights to be non-negative, then an annoying\nside-e\ufb00ect is that f(x) will also be convex in x. Experimentally, monotonic neural nets have\nnot performed as well as monotonic min-max networks or monotonic deep lattice networks\n[21].\nRecently, Amos et al. [22] produced neural networks with partial convex shape constraints\n(but without fully enabling monotonicity shape constraints), focused on the goal of learn-\ning jointly convex functions that are easy to minimize. To achieve convexity, they add\nmonotonicity constraints to weights in the later layers of a neural network using standard\nconvex activation functions. For a single-layer and using max-pooling as the activation, this\nreduces to the same function class as convex piecewise-linear \ufb01tting [11, 12]. To make their\narchitecture more \ufb02exible, they add an unrestricted linear embedding of the inputs into each\nof the later layers.\nDugas et al. [18] proposed an accelerating returns neural network that requires monotonicity\nfor all inputs, and with partial convexity.\nLattice Models: Recent work in monotonic shape constraints has used multi-layer lat-\ntice models [21, 23, 24] (for an open-source implementation see github.com/tensorflow/\nlattice). Lattices are interpolated look-up tables where the look-up table parameters\nde\ufb01ning the function are learned with empirical risk minimization [25]; lattice models can\nalso be expressed as multi-dimensional splines with \ufb01xed knots [26]. Lattice functions can be\nconstrained to be partially monotonic by constraining adjacent parameters in the underlying\nlook-up table to be monotonic for the shape-constrained inputs [24]. Ensembles of lattices\n[23] and deep lattice networks (DLN) [21] can similarly be constrained for monotoncity. Ex-\nperimental results [21] on 4 real datasets showed monotonic DLNs with 3-4 layers performed\nsimilarly or better than min-max networks and simpler 2-layer ensembles of lattices, and\nnotably better than monotonic neural networks.\n\n3 Shape-Constrained Neural Network\n\nWe extend the partial convex neural network of Amos et al. [22] to enable partial diminish-\ning/accelerating returns constraints. Without loss of generality, \ufb02ip any decreasing features\nso that they are increasing before applying the following. Partition the feature vector x into\nthree subsets according to the desired shape constraints, xu, xc, and xs, where we constrain\nthe output f(x; \u03b8) to be convex (concave) with respect to xc, both convex (concave) and\nincreasing with respect to xs, and impose no constraints on xu. Our T-layer SCNN is then\nf(x; \u03b8) = zT , where each layer is de\ufb01ned by the following recurrence (illustrated in Figure 1):\n\n(cid:17)\n\n(cid:16)\n(cid:18)\n\nui+1 = hi\n\nzi+1 = gi\n\nW\n\nW\n\n+W\n\n(6)\ni\n\nand u0 = xu\n\n(1)\ni\n\n(1)\ni ui + b\n(2)\ni\n\n(cid:18)\nzi \u25e6j\nxc \u25e6(cid:16)\n\n(cid:16)\n(3)\ni ui + b\n(7)\n(7)\ni ui + b\nW\ni\ni \u2265 0.\ni \u2265 0, W\n(4)\n(2)\n\nW\n\n(cid:19)\nk\n(cid:17)(cid:17) + W\n\n+\n\n(3)\ni\n\nk\n\n+\n\n(5)\ni ui + b\n\n(5)\ni\n\n(cid:19)\n\nW\n\n(cid:18)\n\nxs \u25e6j\n(cid:17)\n\n(8)\ni ui + b\n\n(8)\ni\n\nand z0 = 0,\n\n+ W\n\n(4)\ni\n\nwhere W\n\n(1)\nwhere \u25e6 refers to the Hadamard product and bxc+ = max(x, 0), hi can be any activation\nfunction but gi must be an increasing and convex (concave) activation to get convex (concave)\nshape constraints; like Amos et al.\n[22] we use ReLU(x) for convexity; for concavity\nwe recommend \u2212ReLU(\u2212x). As is standard, for the last layer there is no activation\nfunction, i.e. gT is the identity function. We augment the architecture proposed in Amos\net al. by adding new terms to support xs and new constraints on gi and W\nto ensure\ndiminishing/accelerating returns. This architecture works as described by the chain rule\n\n(2)\ni\n\n3\n\n\fand induction. Note that one cannot ask (1) to be convex with respect to some inputs and\nconcave in others. Also, partial monotonicity can be imposed, but it comes with a side e\ufb00ect\nof convexity/concavity. Lastly, if more than one feature is constrained to be convex/concave,\nthen (1) exhibits joint convexity/concavity: the function will be convex/concave over any\nline in the constrained feature space.\n\nFigure 1: SCNN Architecture\n\n4 Lattice Models with Convex/Concave Shape Constraints\nWe extend partial monotonic lattice ensemble models [23] [24] to also enable ceterus paribus\npartial convexity or concavity constraints, and thus also partial diminishing/accelerating\nreturns constraints. The proposed lattice models can be individually constrained per-feature\nfor any mix of these shape constraints. Let x[d] be the dth component of the feature vector\nx, and in this section we require that each feature is bounded and has been pre-scaled\nper-feature such that x[d] \u2208 [0, 1] for all d.\n\n4.1 Calibrated Linear Models with Convex/Concave Shape Constraints\nWe start with the simplest lattice model, called a calibrated linear model [24], which is a\ntype of GAM:\n\nf\u03b2,\u03b1(x) =\n\nd c\u03b2d(x[d]),\n\u03b1T\n\n(2)\nwhere the D calibrated inputs c\u03b2d(x[d]) are linearly combined with coe\ufb03cents \u03b1d \u2208 R, which\ncan be individually constrained to be positive or negative if monotonicity is desired. Each\ncalibrator function c\u03b2d : [0, 1] \u2192 [\u22121, 1] is a piece-wise linear function with K \u2212 1 pieces,\nwhich we express as the linear interpolation of a one-dimensional look-up table with K\nkey-value pairs (\u03bdd[k], \u03b2d[k]) for k = 1, . . . , K, where the keys {\u03bdd[k]} are \ufb01xed at the centers\nof the K \u2212 2 uniform quantiles of the training data, and the choice of K is a hyperparameter.\nSee Fig. 3 for example calibrators.\nTo make f(x) monotonic increasing with respect to the dth input, one must constrain the\nadjacent look-up table parameters of the dth calibrator c\u03b2d to be increasing [24]:\n\n\u03b2d[k] \u2265 \u03b2d[k \u2212 1] for k = 2, 3, . . . , K.\n\nDX\n\nd=1\n\n4\n\nTo make f(x) concave with respect to the d input, we must constrain the slopes in the\npiecewise linear function to be decreasing from the left, which requires that the di\ufb00erences\nin the calibrator\u2019s 1-d K-valued look-up table parameters be decreasing from the left:\n\n\u03b2d[k \u2212 1] \u2212 \u03b2d[k \u2212 2]\n\u03bdd[k \u2212 1] \u2212 \u03bdd[k \u2212 2] \u2265 \u03b2d[k] \u2212 \u03b2d[k \u2212 1]\n\n\u03bdd[k] \u2212 \u03bdd[k \u2212 1] for k = 3, 4, . . . , K.\n\n(3)\n\n(4)\n\n\fWe \ufb01x the calibrator keys \u03bd at the quantiles of the training data, so (4) is simply a\nlinear inequality constraint on the three model parameters \u03b2d[k], \u03b2d[k \u2212 1], \u03b2d[k \u2212 2], for\nk = K, K \u2212 1, . . . , 3.\nFor diminishing returns with respect to the dth input, one constrains the dth calibrator to\nbe both increasing and concave by applying both (3) and (4) during training. Analogous\nconstraints are needed to guarantee f(x) convexity and accelerating returns. The model will\nguarantee the requested convex/concave response within the input bounds, but for input\nvalues outside the input\u2019s speci\ufb01ed range, the model will clip the input value to the speci\ufb01ed\nrange, which makes the function implicitly \ufb02at outside the range, thus the accelerating\nreturns guarantee only holds for inputs less than the input\u2019s speci\ufb01ed maximum, and the\ndiminishing returns only holds for inputs greater than the input\u2019s speci\ufb01ed minimum.\nCalibrated linear models are GAMs, with the special case of setting K = N corresponds\nto a nonparametric model, though in practice we \ufb01nd the validated choice of K is usually\n5 \u2212 50, making calibrated linear models much more e\ufb03cient to evaluate than non-parametric\nmodels. Thus, we expect shape-constrained calibrated linear models to perform similarly to\nthe shape-constrained nonparametric GAMs of Chen et al. [9] and Pya and Wood [10].\n\n4.2 Two Layer Lattice Models with Convex/Concave Shape Constraints\n\nNext, we show that how to shape-constrain models with multi-dimensional lattices. While\nour proposal can be extended to deep lattice networks (see Appendix B), we focus on the\ntwo-layer lattice network formed by an ensemble of L calibrated lattices [23]:\n\n\u2018 \u03c6(cid:0)W\u2018[c\u03b2\u20181(x[1]) c\u03b2\u20182(x[2]) . . . c\u03b2\u2018D(x[D])]T(cid:1) ,\n\n\u03b8T\n\nf\u03b2,\u03b8,W (x) =\n\nLX\n\n\u2018=1\n\n(5)\n\nwith de\ufb01nitions as follows. The \u2018th base model calibrates the dth input using c\u03b2\u2018d : [0, 1] \u2192\n[0, 1] as de\ufb01ned in Sec. 4.1, except here we bound its output to [0, 1] so that the 2nd\nlayer inputs lie in the unit hypercube. Next, each linear embedding W\u2018 \u2208 RS\u00d7D outputs S\nvalues. The function \u03c6 : [0, 1]S \u2192 [0, 1]2S is a \ufb01xed kernel that transforms its S-dimensional\ninput into the appropriate linear interpolation weights on the 2S look-up table parameters\nfor the \u2018th lattice: \u03b8\u2018 \u2208 R2S, for each of the L lattices in the ensemble. The formula \u03c6\ndepends on which linear interpolation rule one uses [24]: (i) standard multilinear interpolation\nproduces a multilinear polynomial on z\u2018 but is O(2S), (ii) the Lov\u00e1sz extension (aka simplex\ninterpolation) produces a locally linear interpolation with S! pieces and is O(S log S). We\nrestrict attention to 2S parameter look-up tables (see Appendix A for details on handling\n\ufb01ner-grained look-up tables). We use the random tiny lattices (RTL) strategy to architect\nthe ensemble of lattices [23], which means that before training we randomly select a \ufb01xed\nrandom subset of S of the D features for the \u2018th lattice and \ufb01x the coe\ufb03cients of W\u2018 to be\n{0, 1} to select the chosen S \u2264 D features, where S is a hyperparameter.\nTo produce convex/concave f(x), by linearity it is su\ufb03cient to constrain each base model\nin the ensemble (5) to be convex/concave. If one interpolates the look-up table with the\nstandard multilinear interpolation (by choosing the multilinear interpolation kernel \u03c6), then\nthe \ufb01tted surface forms a multilinear polynomial over [0, 1]S, and is thus ceterus paribus\nlinear in each feature (but nonlinear overall), and thus the lattice layer does not a\ufb00ect\nconvexity/concavity (note this only holds because W acts as a feature selector, (see Appendix\nB) for more on that). Thus, what is needed to guarantee that f(x; \u03b2, \u03b8) is convex/concave\nwith respect to the dth input is that each of the calibrators {c\u03b2\u2018d(x[d])} is convex/concave,\nand each of the T look-up tables is monotonically increasing with respect to x[d]. For the\nmonotonicity constraints, adjacent parameters in the look-up tables need to be constrained\nto be monotonically increasing, which is done by adding the appropriate linear inequality\nconstraints on pairs of parameters (see [24]). What remains is to constrain the 1-d piecewise\nlinear calibrators {c\u03b2\u2018d(x[d])} to be convex/concave, which requires constraining the calibrator\nvalues as per (4). If simplex interpolation is used for \u03c6 instead of multilinear interpolation,\nconstraining for convexity/concavity requires additional constraints on the lattice parameters;\nsee Appendix C for details.\n\n5\n\n\f4.3 Training the Constrained Optimization\n\nThe lattice models are trained using constrained empirical risk minimization, with the\nnecessary constraints de\ufb01ned above to impose the selected shape constraints. Each of those\nconstraints is a linear inequality constraint on two or three model parameters. For the\ncalibrators, there are not very many shape constraints and we simply use projected SGD,\nwhere the Euclidean projection is performed by solving the resulting quadratic program, and\nwe project onto all the calibrator shape constraints after each minibatch of 1000 stochastic\ngradients.\nFor the RTL models, to constrain the lattices for monotonicity, we must potentially satisfy a\nvery large number of linear inequality constraints on pairs of adjacent lattice parameters,\nso solving the QP is less practical.\nInstead, we use the Light Touch algorithm [27] to\nstochastically sample the constraints, on top of Adagrad [28].\n\n5 Experiments\n\nWe demonstrate the applicability and e\ufb00ectiveness of diminishing returns and accelerating\nreturns regularization on \ufb01ve real-world regression problems (three of the \ufb01ve datasets\nare publicly available) with squared error loss (see also Appendix D for simulations). We\ncompare: (i) a standard unconstrained DNN, (ii) the partial convex neural network [22],\n(iii) our shape-constrained neural network as per (1), (iv) calibrated linear models [24],\n(v) calibrated linear models with the proposed added convexity/concavity constraints, (vi)\nrandom tiny lattices (RTLs) [23] which in all cases used monotonic calibrators, (vii) RTLs\nwith the proposed added convexity/concavity constraints.\nOur TensorFlow code used for the SCNN is given in Appendix F. For the lattice models, we\nused proprietary C++ code to train, as described in Sec. 4.3. Similar results can be achieved\nby using the open-source Tensor Flow Lattice package (github.com/tensor\ufb02ow/lattice), which\nalready handles monotonicity constraints, and then adding an additive regularizer to penalize\nviolations of (4).\nFor all model types and each set of constraints, the number of epochs and step sizes\nhyperparameters were optimized based on the validation set. All neural net models were\nrun with TensorFlow using the Adam optimizer; see Sec. 4.3 for lattice model optimization.\nFor all neural network models, the number of layers and number of units per layer were also\nvalidated from 1-9 and 3-1000 respectively. For all the lattice models, the number of keypoints\nin each of the calibrators was validated with one hyperparameter for all calibrators, from\nK = 10, 20, 40, . . .. For the RTLs, the number of lattices in the ensemble L, and the number\nof features per lattice S, were validated. Because the number of di\ufb00erent validation options\nacross model types was di\ufb00erent, there was a risk that the best validated model was simply\na particularly good random model. To control for that risk, after the hyperparameters were\nvalidated, each model type was freshly re-trained once with the validated hyperparameters,\nand the test error reported.\n\n5.1 Car Sales\n\nFor this tiny 1-d problem with 109 training, 14 validation, and 32 test examples (www.kaggle.\ncom/hsinha53/car-sales/data), we predict monthly car sales (in thousands) from the price\n(in thousands). Because it is a 1-d problem, the RTL model is the same as a calibrated\nlinear model, and so was not separately run. Figure 2 shows the more constrained models\nare smoother and more interpretable, because a human can summarize the machine learned\nas, \u201cHigher price cars decrease sales, but absolute price di\ufb00erences matter less the higher the\nprice.\u201d Table 1 shows the Test MSE is slightly better for the calibrated linear model with\nthe added convexity constraint. The convex SCNN was already decreasing in this case; the\nextra decreasing monotonicity constraint did not hurt.\n\n6\n\n\fTable 1: Experimental Results: Car Sales and Puzzle Sales\n\nCar Sales\n\nModel\n\nModel\n\nDNN\nSCNN conv.\nSCNN conv. decr.\nCal Lin. decr.\nCal Lin. conv. decr.\n\nVal.\nTest\nMSE MSE\n2035\n2262\n2442\n2271\n2304\n\nPuzzles Sales\nVal.\nTest\nMSE MSE\n5652\n2189\n7931\n2632\n6927\n2437\n4457\n8838\n8315\n3543\n8270\n3589\n3617\n8189\n\n10931 DNN\n10613\n10590\n10727 RTL incr.\n10593 RTL all\n\nSCNN conc.\nSCNN conc. incr.\n\nCal Lin. incr.\nCal Lin. all\n\nFigure 2: Car sales prediction task. Left: Training dataset. Center: Predictions for neural\nnets; y-axis zoomed-in. Right: Predictions for calibrated linear models; y-axis zoomed-in.\n\n5.2 Puzzle Sales from Reviews\nFor this small problem (3 features, 156 training, 169 validation, and 200 non-IID test examples,\ndataset courtesy of Artifact Puzzles and available at www.kaggle.com/dbahri/puzzles),\nwe predict the 6-month sales of di\ufb00erent wooden jigsaw puzzles from three features based on\nits Amazon reviews: its average star rating, the number of reviews, and the average word\ncount of its reviews. Business experts expect star rating to have a positive e\ufb00ect on sales,\nthe number of reviews to have a diminishing returns e\ufb00ect on sales, and word count to have\na diminishing returns shape (the 100th word is not as important as the 10th word). The\ntrain/val/test data is non-IID in that it is split by time over the past 18 months, and the\nset of puzzles is the same across the datasets, with some new puzzles added over time. Fig.\n3 shows the number of reviews calibrator learned for the RTL models. Table 1 shows the\nTest MSE is slightly better with the added shape constraint for all three models (SCNN,\nRTL, Cal. Linear). The SCNN concave increasing model is constrained to be concave in\nboth word count and number of reviews and increasing in number of reviews (but does\nnot shape-constrain the star rating feature because we cannot use the SCNN to constrain\nfeatures positively unless we constrain them to also be concave). The lattice models either\nimposed just the monotonicity constraints, or all the expected constraints.\n\n5.3 Domain Name Pricing\nIn this experiment (18 features, 1,522 training, 435 validation, and 217 test examples), we\nillustrate the use of concavity/convexity constraints without any monotonicity constraints.\nThe goal is to learn a model that can automatically price domain names for Google\u2019s .app\ndomain. The label is the percent of humans who rated each example domain name as\n\u201cpremium,\u201d vs \u201cnon-premium.\u201d Estimates for new domains were then quantized into pricing\ntiers. One feature is the number of characters in the domain name, and our experts believe\nthat f(x) should be concave and non-monotonic in this feature. The other features measure\nthe popularity of the ngrams in the domain name according to di\ufb00erent internet services.\n\n7\n\n255075price (thousands)0100200300sales (thousands)255075price (thousands)020406080100predicted sales (thousands)DNNconvex SCNNconvex, dec. SCNN255075price (thousands)020406080predicted sales (thousands)dec. Cal Lin.convex, dec. Cal Lin.\fFigure 3: Example calibrator curves learned by the lattice models. The diminishing returns\ncalibrators (blue) are easier to interpret and summarize, and appears less over-\ufb01t than\nthe monotonic calibrators (red). Left: Calibrated linear model\u2019s calibrators for number of\nreviews for the Puzzles Sales problem. Most of the training examples are in the [1,15] input\nrange, where the curves for the two trained lattice models are very similar. Right: RTL\nmodel\u2019s calibrators for the relatedness feature for the Query Result Matching problem.\n\nTable 2: Experimental Results: Domain Pricing and Wine Quality\n\nDomain Pricing\n\nWine Quality\n\nModel\nDNN\nSCNN conc.\nRTL unc.\nRTL conc.\nCal Lin. unc.\nCal Lin. conc.\n\nVal. MSE Test MSE Model\nDNN\n0.00301\nSCNN conc.\n0.00220\nSCNN conc. incr.\n0.1014\n0.0978\nRTL incr.\nRTL conc. incr.\n0.1101\nCal Lin. incr.\n0.1091\nCal Lin. conc. incr.\n\n2.00\n0.00219\n0.1109\n0.1078\n0.1150\n0.1120\n\nVal. MSE Test MSE\n4.91\n5.96\n6.13\n4.96\n4.96\n5.25\n5.23\n\n4.79\n7.22\n6.21\n4.85\n4.83\n5.10\n5.10\n\nThe results in Table 2 show the extra concavity constraint slightly improves both the validation\nerror and the test error for the RTL and Calibrated Linear model. The DNN\u2019s 2.00 test\nMSE was an unlucky run; we re-trained the DNN with the same validated hyperparameters\n100 times, and only saw test MSE that high 6 times. Similarly, the SCNN got very lucky,\nwhen we re-trained it 100 times with the same validated hyperparameters, its test MSE was\nonly as low as shown in Table 2 for 5 of the 100 times. For more data on re-training churn\nfor these models, see Appendix E.\n\n5.4 Wine Enthusiast Magazine Reviews\nThe goal is to predict a wine\u2019s quality measured in points [80, 100] based on price (the most\nimportant feature), country (21 Bools), and 39 Bool features based on the wine description\nfrom Wine Enthusiast Magazine (61 features, 84,642 training, 12,092 validation, and 24,185\ntest examples; www.kaggle.com/dbahri/wine-ratings). Table 2 shows that constraining\nthe price feature does not have much e\ufb00ect on the Test MSE for the RTL and Calibrated\nLinear models; visual inspection of the learned calibrators (not shown) showed they both\npicked up the correct shape with or without the shape constraint. The SCNN models had a\ndi\ufb03cult time \ufb01tting this dataset.\n\n5.5 Query-Result Matching\nThe goal of this problem is to learn how well a candidate result matches a query, for a\nparticular category of queries. The dataset (1,282,532 training, 183,219 validation, 366,440\nIID test, with 15 features) is proprietary. The 15 features are derived for each {query, result}\nexample, and the label is an averaged human rating of the match quality, from [0, 4].\n\n8\n\n\fWe give results in Table 3 using the full data for train/validation, and using only 1/10 of the\ntrain/validation data; the test set is the same throughout. We would like to constrain 14 of\nthe D = 15 features to be monotonic, based on prior knowledge and policies about how the\nfeatures should impact the output. We constrain the most important feature, relatedness to\nbe concave, based on observing that the shape of its calibrator in an unconstrained calibrated\nlinear model appears to exhibit noisy diminishing returns, as shown in Figure 2. For the\nSCNN with diminishing returns, only the concave feature is constrained to be monotonic.\nThe biggest e\ufb00ect of the shape constraints is for the calibrated linear models with the smaller\ntraining set. The RTL test MSE is hurt a little by the shape constraints on the full training\nset (recall that even the unconstrained RTL model does constrain its calibrators to be\nmonotonic, so all the RTL models get the e\ufb00ect of the monotonic calibrated linear model).\nWe believe this is because there are some parts of the feature space that are very sparse,\nand satisfying the monotonicity constraints everywhere reduces the RTL\u2019s ability to use all\nits \ufb02exibility to do better on average by better \ufb01tting the denser parts of the feature space.\nIn practice the test distribution is not IID with the train distribution, and we consider it\na worthwhile trade-o\ufb00 to have the model constrained to behave sensibly throughout the\npotential feature space to protect against embarrassing errors and improve debuggability.\nAs in our other experiments, the lattice models were more stable across re-trainings than\nthe neural nets: the test MSE standard deviation with the 1/10 train set and each model\u2019s\nvalidated hyperparameters over \ufb01ve re-trainings was .002 for the RTL models, .084 for the\nSCNN concave, .047 for the SCNN dim. ret., and 1.69 for the DNN. Surprisingly, the dim.\nret SCNN actually does worse than the concave SCNN, even though the constrained feature\nis de\ufb01nitely a strongly positive signal, but we believe the SCNN MSE di\ufb00erences merely\nre\ufb02ect the randomness in training and hyperparameter choices.\n\nTable 3: Experimental Results: Query Result Matching\n\nModel\nDNN\nSCNN conc.\nSCNN dim. ret.\nRTL unc.\nRTL mono\nRTL all\nCal. Lin. unc.\nCal. Lin. mono\nCal. Lin. all\n\n1/10 Train Data\n\nFull Train Data\n\nVal. MSE Test MSE Val. MSE Test MSE\n\n0.668\n0.667\n0.673\n0.661\n0.663\n0.663\n0.718\n0.699\n0.696\n\n0.668\n0.673\n0.680\n0.658\n0.662\n0.661\n0.756\n0.710\n0.702\n\n0.655\n0.652\n0.659\n0.639\n0.655\n0.655\n0.715\n0.701\n0.724\n\n0.656\n0.658\n0.667\n0.639\n0.654\n0.654\n0.743\n0.701\n0.722\n\n6 Conclusions\nWe speci\ufb01ed the additional constraints needed to learn \ufb02exible lattice models that can impose\nany mixture of convexity/concavity/monotonicity constraints over subsets of features, ceterus\nparibus, and showed we can stably train models with these extra linear inequality constraints.\nThe additional shape constraints produce smoother models that are easier to summarize,\nexplain and debug, because their behavior is more predictable and is known to satisfy the\nspeci\ufb01ed global properties. Experimental results on real-world problems showed the extra\nconvexity/concaveity shape constraints either reduced or did not a\ufb00ect test MSE on IID\ntest sets, and provided the most value for the non-IID Puzzles experiment, where shape\nconstraint regularization is expected to be most valuable. We also extended neural networks\nto handle partial diminishing returns constraints. The resulting SCNNs enable a less \ufb02exible\nmenu of shape constraint choices, and their experimental results were more sensitive to\nhyperparameter choices and stochasticity in mini-batch sampling.\n\n9\n\n\fReferences\n[1] D. Kahneman and A. Tversky. Choices, Values and Frames. Cambridge University\n\nPress, 2000.\n\n[2] A. Smith. On the Principles of Political Economy and Taxation. 1817.\n[3] F. L. Patton. Diminishing Returns in Agriculture. 1926.\n[4] A. Smith. An Inquiry into the Nature and Causes of the Wealth of Nations. 1776.\n[5] S. Shingo. Zero quality control: source inspection and the poka-yoke system. Productivity\n\nPress, Portland, USA, 1986.\n\n[6] T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman Hall, New York,\n\n1990.\n\n[7] P. Groeneboom and G. Jongbloed. Nonparametric estimation under shape constraints.\n\nCambridge Press, New York, USA, 2014.\n\n[8] R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk. Statistical inference\nunder order restrictions; the theory and application of isotonic regression. Wiley, New\nYork, USA, 1972.\n\n[9] Y. Chen and R. J. Samworth. Generalized additive and index models with shape\n\nconstraints. Journal Royal Statistical Society B, 2016.\n\n[10] N. Pya and S. N. Wood. Shape constrained additive models. Statistics and Computing,\n\n2015.\n\n[11] J. Kim, J. Lee, L. Vandenberghe, and C.-K. Yang. Techniques for improving the\naccuracy of geometric-programming based analog circuit design optimization. Proc.\nIEEE International Conference on Computer-aided Design, 2004.\n\n[12] A. Magnani and S. P. Boyd. Convex piecewise-linear \ufb01tting. Optimization and Engi-\n\nneering, 2009.\n\n[13] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press,\n\nCambridge, 2008.\n\n[14] J. Sill. Monotonic networks. Advances in Neural Information Processing Systems\n\n(NIPS), 1998.\n\n[15] N. P. Archer and S. Wang. Application of the back propagation neural network algorithm\nwith monotonicity constraints for two-group classi\ufb01cation problems. Decision Sciences,\n24(1):60\u201375, 1993.\n\n[16] H. Kay and L. H. Ungar. Estimating monotonic functions and their bounds. AIChE\n\nJournal, 46(12):2426\u20132434, 2000.\n\n[17] A. Minin, M. Velikova, B. Lang, and H. Daniels. Comparison of universal approximators\nincorporating partial monotonicity by structure. Neural Networks, 23(4):471\u2013475, 2010.\n[18] C. Dugas, Y. Bengio, F. B\u00e9lisle, C. Nadeau, and R. Garcia. Incorporating functional\n\nknowledge in neural networks. Journal Machine Learning Research, 2009.\n\n[19] Y.-J. Qu and B.-G. Hu. Generalized constraint neural network regression model subject\n\nto linear priors. IEEE Trans. on Neural Networks, 22(11):2447\u20132459, 2011.\n\n[20] H. Daniels and M. Velikova. Monotone and partially monotone neural networks. IEEE\n\nTrans. Neural Networks, 21(6):906\u2013917, 2010.\n\n[21] S. You, D. Ding, K. Canini, J. Pfeifer, and M. R. Gupta. Deep lattice networks\nand partial monotonic functions. Advances in Neural Information Processing Systems\n(NIPS), 2017.\n\n10\n\n\f[22] B. Amos, L. Xu, and J. Z. Kolter. Input convex neural networks. Proc. ICML, 2017.\n[23] K. Canini, A. Cotter, M. M. Fard, M. R. Gupta, and J. Pfeifer. Fast and \ufb02exible mono-\ntonic functions with ensembles of lattices. Advances in Neural Information Processing\nSystems (NIPS), 2016.\n\n[24] M. R. Gupta, A. Cotter, J. Pfeifer, K. Voevodski, K. Canini, A. Mangylov, W. Moczyd-\nlowski, and A. Van Esbroeck. Monotonic calibrated interpolated look-up tables. Journal\nof Machine Learning Research, 17(109):1\u201347, 2016.\n\n[25] E. K. Garcia and M. R. Gupta. Lattice regression. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2009.\n\n[26] E. K. Garcia, R. Arora, and M. R. Gupta. Optimized regression for e\ufb03cient function\n\nevaluation. IEEE Trans. Image Processing, 21(9):4128\u20134140, Sept. 2012.\n\n[27] A. Cotter, M. R. Gupta, and J. Pfeifer. A light touch for heavily constrained SGD. In\n\n29th Annual Conference on Learning Theory, pages 729\u2013771, 2016.\n\n[28] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning\nand stochastic optimization. Journal Machine Learning Research, 12:2121\u20132159, 2011.\n[29] L. Lov\u00e1sz. Submodular functions and convexity. In Mathematical Programming The\n\nState of the Art, pages 235\u2013257. Springer, 1983.\n\n[30] M. Milani Fard, Q. Cormier, K. Canini, and M. R. Gupta. Launch and iterate: Reducing\nprediction churn. Advances in Neural Information Processing Systems (NIPS), 2016.\n\n11\n\n\f", "award": [], "sourceid": 3429, "authors": [{"given_name": "Maya", "family_name": "Gupta", "institution": "Google"}, {"given_name": "Dara", "family_name": "Bahri", "institution": "Google AI"}, {"given_name": "Andrew", "family_name": "Cotter", "institution": "Google"}, {"given_name": "Kevin", "family_name": "Canini", "institution": "Google"}]}