{"title": "Deep Lattice Networks and Partial Monotonic Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 2981, "page_last": 2989, "abstract": "We propose learning deep models that are monotonic with respect to a user-specified set of inputs by alternating layers of linear embeddings, ensembles of lattices, and calibrators (piecewise linear functions), with appropriate constraints for monotonicity, and jointly training the resulting network. We implement the layers and projections with new computational graph nodes in TensorFlow and use the Adam optimizer and batched stochastic gradients. Experiments on benchmark and real-world datasets show that six-layer monotonic deep lattice networks achieve state-of-the art performance for classification and regression with monotonicity guarantees.", "full_text": "Deep Lattice Networks and Partial Monotonic\n\nFunctions\n\nSeungil You, David Ding, Kevin Canini, Jan Pfeifer, Maya R. Gupta\n\nGoogle Research\n\n1600 Amphitheatre Parkway, Mountain View, CA 94043\n\n{siyou,dwding,canini,janpf,mayagupta}@google.com\n\nAbstract\n\nWe propose learning deep models that are monotonic with respect to a user-\nspeci\ufb01ed set of inputs by alternating layers of linear embeddings, ensembles of\nlattices, and calibrators (piecewise linear functions), with appropriate constraints\nfor monotonicity, and jointly training the resulting network. We implement the lay-\ners and projections with new computational graph nodes in TensorFlow and use the\nAdam optimizer and batched stochastic gradients. Experiments on benchmark and\nreal-world datasets show that six-layer monotonic deep lattice networks achieve\nstate-of-the art performance for classi\ufb01cation and regression with monotonicity\nguarantees.\n\n1\n\nIntroduction\n\nWe propose building models with multiple layers of lattices, which we refer to as deep lattice networks\n(DLNs). While we hypothesize that DLNs may generally be useful, we focus on the challenge of\nlearning \ufb02exible partially-monotonic functions, that is, models that are guaranteed to be monotonic\nwith respect to a user-speci\ufb01ed subset of the inputs. For example, if one is predicting whether to\ngive someone else a loan, we expect and would like to constrain the prediction to be monotonically\nincreasing with respect to the applicant\u2019s income, if all other features are unchanged. Imposing\nmonotonicity acts as a regularizer, improves generalization to test data, and makes the end-to-end\nmodel more interpretable, debuggable, and trustworthy.\nTo learn more \ufb02exible partial monotonic functions, we propose architectures that alternate three\nkinds of layers: linear embeddings, calibrators, and ensembles of lattices, each of which is trained\ndiscriminatively to optimize a structural risk objective and obey any given monotonicity constraints.\nSee Fig. 2 for an example DLN with nine such layers.\nLattices are interpolated look-up tables, as shown in Fig. 1. Lattices have been shown to be an\nef\ufb01cient nonlinear function class that can be constrained to be monotonic by adding appropriate\nsparse linear inequalities on the parameters [1], and can be trained in a standard empirical risk\nminimization framework [2, 1]. Recent work showed lattices could be jointly trained as an ensemble\nto learn \ufb02exible monotonic functions for an arbitrary number of inputs [3].\nCalibrators are one-dimensional lattices, which nonlinearly transform a single input [1]; see Fig. 1 for\nan example. They have been used to pre-process inputs in two-layer models: calibrators-then-linear\nmodels [4], calibrators-then-lattice models [1], and calibrators-then-ensemble-of-lattices model [3].\nHere, we extend their use to discriminatively normalize between other layers of the deep model, as\nwell as act as a pre-processing layer. We also \ufb01nd that using a calibrator for a last layer can help\nnonlinearly transform the outputs to better match the labels.\nWe \ufb01rst describe the proposed DLN layers in detail in Section 2. In Section 3, we review more related\nwork in learning \ufb02exible partial monotonic functions. We provide theoretical results characterizing\nthe \ufb02exibility of the DLN in Section 4, followed by details on our open-source TensorFlow imple-\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Left: Example calibrator (1-d lattice) with \ufb01xed input range [\u221210, 10] and \ufb01ve \ufb01xed\nuniformly-spaced keypoints and corresponding discriminatively-trained outputs (look-up table values\nvalues). Middle: Example lattice on three inputs in \ufb01xed input range [0, 1]3, with 8 discriminatively-\ntrained parameters (shown as gray-values), each corresponding to one of the 23 vertices of the unit\nhypercube. The parameters are linearly interpolated for any input [0, 1]3 to form the lattice function\u2019s\noutput. If the parameters are increasing in any direction, then the function is monotonic increasing\nin that direction. In this example, the gray-value parameters get lighter in all three directions, so\nthe function is monotonic increasing in all three inputs. Right: Three examples of lattice values are\nshown in italics, each interpolated from the 8 lattice parameters.\n\nFigure 2: Illustration of a nine-layer DLN: calibrators, linear embedding, calibrators, ensemble of\nlattices, calibrators, ensemble of lattices, calibrators, lattice, calibrator.\n\nmentation and numerical optimization choices in Section 5. Experimental results demonstrate the\npotential on benchmark and real-world scenarios in Section 6.\n\n2 Deep Lattice Network Layers\n\nWe describe in detail the three types of layers we propose for learning \ufb02exible functions that can\nbe constrained to be monotonic with respect to any subset of the inputs. Without loss of generality,\nwe assume monotonic means monotonic non-decreasing (one can \ufb02ip the sign of an input if non-\nincreasing monotonicity is desired). Let xt \u2208 RDt be the input vector to the tth layer, with Dt\ninputs, and let xt[d] denote the dth input for d = 1, . . . , Dt. Table 1 summarizes the parameters and\nhyperparameters for each layer. For notational simplicity, in some places we drop the notation t if it\nis clear in the context. We also denote as xm\nthe subset of xt that are to be monotonically constrained,\nt\nand as xn\nt the subset of xt that are non-monotonic.\nLinear Embedding Layer: Each linear embedding layer consists of two linear matrices, one\nt \u2208 RDm\nt , and a separate matrix\nmatrix W m\nt \u2208 R(Dt+1\u2212Dm\nt , and one bias vector\nW n\n\nthat linearly embeds the monotonic inputs xm\nt ) that linearly embeds non-monotonic inputs xn\n\nt+1\u00d7Dm\nt+1)\u00d7(Dt\u2212Dm\n\nt\n\n2\n\n-10 -5 0 5 10 01(1,0,0)(0,0,0)(1,1,0)(1,1,1)(1,0,1)(0,0,1)(0,1,1)(1,0,0)(0,0,0)(1,1,0)(1,1,1)(1,0,1)(0,1,1)(.7,0,.8)(.2,0,.4)(.5,5,1)(0,0,1)1d calibrator monotonicMonotonicinputsWm \u2265 0Non-monotonicinputsWnmulti-d lattice non-monotonic \fbt. To preserve monotonicity on the embedded vector W m\nt xm\ninequality constraints:\nt [i, j] \u2265 0 for all (i, j).\nW m\n\nThe output of the linear embedding layer is:\n\n(cid:20) xm\n\nt+1\n\nxn\n\nt+1\n\n(cid:21)\n\n=\n\n(cid:20) W m\n\nt xm\nt\nt xn\nt\n\nW n\n\nxt+1 =\n\nt , we impose the following linear\n\n(cid:21)\n\n(1)\n\n+ bt\n\nK(cid:88)\n\nt+1 coordinates of xt+1 needs to be a monotonic input to the t + 1 layer. These two\n\nOnly the \ufb01rst Dm\nlinear embedding matrices and bias vector are discriminatively trained.\nCalibration Layer: Each calibration layer consists of a separate one-dimensional piecewise linear\ntransform for each input at that layer, ct,d(xt[d]) that maps R to [0, 1], so that\nct,Dt(xt[Dt])]T .\n\nxt+1 := [ct,1(xt[1])\n\nct,2(xt[2])\n\n\u00b7\u00b7\u00b7\n\nHere each ct,d is a 1D lattice with K key-value pairs (a \u2208 RK, b \u2208 RK), and the function for each\ninput is linearly interpolated between the two b values corresponding to the input\u2019s surrounding a\nvalues. An example is shown on the left in Fig. 1.\nEach 1D calibration function is equivalent to a sum of weighted-and-shifted Recti\ufb01ed linear units\n(ReLU), that is, a calibrator function c(x[d]; a, b) can be equivalently expressed as\n\nc(x[d]; a, b) =\n\n\u03b1[k]ReLU(x \u2212 a[k]) + b[1],\n\n(2)\n\nwhere\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\n\u03b1[k]\n\n:=\n\nk=1\n\nb[k+1]\u2212b[k]\na[k+1]\u2212a[k] \u2212 b[k]\u2212b[k\u22121]\na[k]\u2212a[k\u22121]\nb[2]\u2212b[1]\na[2]\u2212a[1]\n\u2212 b[K]\u2212b[K\u22121]\na[K]\u2212a[K\u22121]\n\nfor k = 2,\u00b7\u00b7\u00b7 , K \u2212 1\nfor k = 1\nfor k = K\n\nHowever, enforcing monotonicity and boundedness constraints for the calibrator output is much\nsimpler with the (a, b) parameterization of each keypoint\u2019s input-output values, as we discuss shortly.\nBefore training the DLN, we \ufb01x the input range for each calibrator to [amin, amax], and we \ufb01x the K\nkeypoints a \u2208 RK to be uniformly-spaced over [amin, amax]. Inputs that fall outside [amin, amax] are\nclipped to that range. The calibrator output parameters b \u2208 [0, 1]K are discriminatively trained.\nFor monotonic inputs, we can constrain the calibrator functions to be monotonic by constraining the\ncalibrator parameters b \u2208 [0, 1]K to be monotonic, by adding the linear inequality constraints\n\nb[k] \u2264 b[k + 1] for k = 1, . . . , K \u2212 1\n\n(3)\ninto the training objective [3]. We also experimented with constraining all calibrators to be monotonic\n(even for non-monotonic inputs) for more stable/regularized training.\nEnsemble of Lattices Layer: Each ensemble of lattices layer consists of G lattices. Each lattice\nis a linearly interpolated multidimensional look-up table; for an example, see the middle and right\npictures in Fig. 1. Each S-dimensional look-up table takes inputs over the S-dimensional unit\nhypercube [0, 1]S, and has 2S parameters \u03b8 \u2208 R2S , specifying the lattice\u2019s output for each of the\n2S vertices of the unit hypercube. Inputs in-between the vertices are linearly interpolated, which\nforms a smooth but nonlinear function over the unit hypercube. Two interpolation methods have been\nused, multilinear interpolation and simplex interpolation [1] (also known as Lov\u00e1sz extension [5]).\nWe use multilinear interpolation for all our experiments, which can be expressed \u03c8(x)T \u03b8 where the\nnon-linear feature transformation \u03c8(x) : [0, 1]S \u2192 [0, 1]2S are the 2S linear interpolation weights\nthat input x puts on each of the 2S parameters \u03b8 such that the interpolated value for x is \u03c8(x)T \u03b8, and\nd=1x[d]vj [d](1\u2212x[d])1\u2212vj [d], where vj[\u00b7] \u2208 0, 1 is the coordinate vector of the jth vertex\n\u03c8(x)[j] = \u03a0S\nof the unit hypercube, and j = 1,\u00b7\u00b7\u00b7 , 2D. For example, when S = 2, v1 = (0, 0), v2 = (0, 1), v3 =\n(1, 0), v4 = (1, 1) and \u03c8(x) = ((1 \u2212 x[1])(1 \u2212 x[2]), (1 \u2212 x[1])x[2], x[1](1 \u2212 x[2]), x[1]x[2]).\nThe ensemble of lattices layer produces G outputs, one per lattice. When initializing the DLN, if\nthe t + 1th layer is an ensemble of lattices, we randomly permute the outputs of the previous layer\n\n3\n\n\fTable 1: DLN layers and hyperparameters\n\nLayer t\nLinear Embedding\n\nCalibrators\n\nLattice Ensemble\n\nt \u2208 RDm\nt+1\u00d7Dm\nt , Dt+1\nt+1)\u00d7(Dt\u2212Dm\nt )\n\nParameters\nbt \u2208 RDt+1, W m\nt \u2208 R(Dt+1\u2212Dm\nW n\nBt \u2208 RDt\u00d7K\n\u03b8t,g \u2208 R2St for g = 1, . . . , Gt\n\nHyperparameters\n\nK \u2208 N+ keypoints,\ninput range [(cid:96), u]\nGt lattices\nSt inputs per lattice\n\nto be assigned to the Gt+1 \u00d7 St+1 inputs of the ensemble. If a lattice has at least one monotonic\ninput, then that lattice\u2019s output is constrained to be a monotonic input to the next layer to guarantee\nend-to-end monotonicity. Each lattice is constrained to be monotonic by enforcing monotonicity\nconstraints on each pair of lattice parameters that are adjacent in the monotonic directions; for details\nsee Gupta et al. [1].\nEnd-to-end monotonicity: The DLN is constructed to preserve end-to-end monotonicity with\nrespect to a user-speci\ufb01ed subset of the inputs. As we described, the parameters for each component\n(matrix, calibrator, lattice) can be constrained to be monotonic with respect to a subset of inputs\nby satisfying certain linear inequality constraints [1]. Also if a component has a monotonic input,\nthen the output of that component is treated as a monotonic input to the following layer. Because\nthe composition of monotonic functions is monotonic, the constructed DLN belongs to the partial\nmonotonic function class. The arrows in Figure 2 illustrate this construction, i.e., how the tth layer\noutput becomes a monotonic input to t + 1th layer.\n\n2.1 Hyperparameters\n\nWe detail the hyperparameters for each type of DLN layer in Table 1. Some of these hyperparameters\nconstrain each other since the number of outputs from each layer must be equal to the number of\ninputs to the next layer; for example, if you have a linear embedding layer with Dt+1 = 1000\noutputs, then there are 1000 inputs to the next layer, and if that next layer is a lattice ensemble, its\nhyperparameters must obey Gt \u00d7 St = 1000.\n\n3 Related Work\n\nLow-dimensional monotonic models have a long history in statistics, where they are called shape\nconstraints, and often use isotonic regression [6]. Learning monotonic single-layer neural nets by\nconstraining the neural net weights to be positive dates back to Archer and Wang in 1993 [7], and\nthat basic idea has been re-visited by others [8, 9, 10, 11], but with some negative results about\nthe obtainable \ufb02exibility, even with multiple hidden layers [12]. Sill [13] proposed a three-layer\nmonotonic network that used monotonic linear embedding and max-and-min-pooling. Daniels and\nVelikova [12] extended Sill\u2019s result to learn a partial monotonic function by combining min-max-\npooling, also known as adaptive logic networks [14], with partial monotonic linear embedding, and\nshowed that their proposed architecture is a universal approximator for partial monotone functions.\nNone of these prior neural networks were demonstrated on problems with more than D = 10 features,\nnor trained on more than a few thousand examples. For our experiments we implemented a positive\nneural network and a min-max-pooling network [12] with TensorFlow.\nThis paper extends recent work in learning multidimensional \ufb02exible partial monotonic 2-layer\nnetworks consisting of a layer of calibrators followed by an ensemble of lattices [3], with parameters\nappropriately constrained for monotonicity, which built on earlier work of Gupta et al. [1]. This work\ndiffers in three key regards.\nFirst, we alternate layers to form a deeper, and hence potentially more \ufb02exible, network. Second, a\nkey question addressed in Canini et al. [3] is how to decide which features should be put together in\neach lattice in their ensemble. They found that random assignment worked well, but required large\nensembles. They showed that smaller (and hence faster) models with the same accuracy could be\n\n4\n\n\ftrained by using a heuristic pre-processing step they proposed (crystals) to identify which features\ninteract nonlinearly. This pre-processing step requires training a lattice for each pair of inputs to judge\nthat pair\u2019s strength of interaction, which scales as O(D2), and we found it can be a large fraction of\noverall training time for D > 50.\nWe solve this problem of determining which inputs should interact in each lattice by using a linear\nembedding layer before an ensemble of lattices layer to discriminatively and adaptively learn during\ntraining how to map the features to the \ufb01rst ensemble-layer lattices\u2019 inputs. This strategy also means\neach input to a lattice can be a linear combination of the features. This use of a jointly trained linear\nembedding is the second key difference to that prior work [3].\nThe third difference is that in previous work [4, 1, 3], the calibrator keypoint values were \ufb01xed a\npriori based on the quantiles of the features, which is challenging to do for the calibration layers\nmid-DLN, because the quantiles of their inputs are evolving during training. Instead, we \ufb01x the\nkeypoint values uniformly over the bounded calibrator domain.\n\n4 Function Class of Deep Lattice Networks\n\nWe offer some results and hypotheses about the function class of deep lattice networks, depending\non whether the lattices are interpolated with multilinear interpolation (which forms multilinear\npolynomials), or simplex interpolation (which forms locally linear surfaces).\n\n4.1 Cascaded multilinear lookup tables\n\nWe show that a deep lattice network made up only of cascaded layers of lattices (without intervening\nlayers of calibrators or linear embeddings) is equivalent to a single lattice de\ufb01ned on the D input\nfeatures if multilinear interpolation is used. It is easy to construct counter-examples showing that this\nresult does not hold for simplex-interpolated lattices.\nLemma 1. Suppose that a lattice has L inputs that can each be expressed in the form \u03b8T\ni \u03c8(x[si]),\nwhere the si are mutually disjoint and \u03c8 represents multilinear interpolation weights. Then the output\ncan be expressed in the form \u02c6\u03b8T \u02c6\u03c8(x[\u222asi]). That is, the lattice preserves the functional form of its\ninputs, changing only the values of the coef\ufb01cients \u03b8 and the linear interpolation weights \u03c8.\n\nProof. Each input i of the lattice can be expressed in the following form:\n\nfi = \u03b8T\n\ni \u03c8(x[si]) =\n\n\u03b8i[vik]\n\nx[d]vik[d](1 \u2212 x[d])1\u2212vik[d]\n\nThis is a multilinear polynomial on x[si]. The output can be expressed in the following form:\n\n2|si|(cid:88)\n2L(cid:88)\n\nk=1\n\nj=1\n\n(cid:89)\n\nd\u2208si\n\nL(cid:89)\n\ni=1\n\nF =\n\n\u03b8i[vj]\n\nf vj [i]\ni\n\n(1 \u2212 fi)1\u2212vj [i]\n\nNote the product in the expression: fi and 1 \u2212 fi are both multilinear polynomials, but within each\nterm of the product, only one is present, since one of the two has exponent 0 and the other has\nexponent 1. Furthermore, since each fi is a function of a different subset of x, we conclude that\nthe entire product is a multilinear polynomial. Since the sum of multilinear polynomials is still a\nmultilinear polynomial, we conclude that F is a multilinear polynomial. Any multilinear polynomial\non k variables can be converted to a k-dimensional multilinear lookup table, which concludes the\nproof.\n\nLemma 1 can be applied inductively to every layer of cascaded lattices down to the \ufb01nal output\nF (x). We have shown that cascaded lattices using multilinear interpolation is equivalent to a single\nmultilinear lattice de\ufb01ned on all D features.\n\n4.2 Universal approximation of partial monotone functions\n\nTheorem 4.1 in [12] states that partial monotone linear embedding followed by min and max pooling\ncan approximate any partial monotone functions on the hypercube up to arbitrary precision given\n\n5\n\n\fsuf\ufb01ciently high embedding dimension. We show in the next lemma that simplex-interpolated lattices\ncan represent min or max pooling. Thus one can use a DLN constructed with a linear embedding layer\nfollowed by two cascaded simplex-interpolated lattice layers to approximate any partial monotone\nfunction on the hypercube.\nLemma 2. Let \u03b8min = (0, 0,\u00b7\u00b7\u00b7 , 0, 1) \u2208 R2n and \u03b8max = (1, 0,\u00b7\u00b7\u00b7 , 0) \u2208 R2n, and \u03c8simplex be\nthe simplex interpolation weights. Then\n\nmin(x[0], x[1],\u00b7\u00b7\u00b7 , x[n]) = \u03c8simplex(x)T \u03b8min\nmax(x[0], x[1],\u00b7\u00b7\u00b7 , x[n]) = \u03c8simplex(x)T \u03b8max\n\nProof. From the de\ufb01nition of simplex interpolation [1], \u03c8simplex(x)T \u03b8 = \u03b8[1]x[\u03c0[1]] + \u00b7\u00b7\u00b7 +\n\u03b8[2n]x[\u03c0[n]], where \u03c0 is the sorted order such that x[\u03c0[1]] \u2265 \u00b7\u00b7\u00b7 \u2265 x[\u03c0[n]], and due to sparsity, \u03b8min\nand \u03b8max selects the min and the max.\n\n4.3 Locally linear functions\n\nIf simplex interpolation [1] (aka the Lov\u00e1sz extension) is used, the deep lattice network produces\na locally linear function, because each layer is locally linear, and compositions of locally linear\nfunctions are locally linear. Note that a D input lattice interpolated with simplex interpolation has D!\nlinear pieces [1]. If one cascades an ensemble of D lattices into a lattice, then the number of possible\nlocally linear pieces is of the order O((D!)!).\n\n5 Numerical Optimization Details for the DLN\n\nOperators: We implemented 1D calibrators and multilinear interpolation over a lattice as new C++\noperators in TensorFlow [15] and express each layer as a computational graph node using these\nnew and existing TensorFlow operators. Our implementation is open sourced and can be found\nin https://github.com/tensorflow/lattice. We use the Adam optimizer [16] and\nbatched stochastic gradients to update model parameters. After each batched gradient update, we\nproject parameters to satisfy their monotonicity constraints. The linear embedding layer\u2019s constraints\nare element-wise non-negativity constraints, so its projection clips each negative component to zero.\nThis projection can be done in O(# of elements in a monotonic linear embedding matrix). Projection\nfor each calibrator is isotonic regression with chain ordering, which we implement with the pool-\nadjacent-violator algorithm [17] for each calibrator. This can be done in O(# of calibration keypoints).\nProjection for each lattice is isotonic regression with partial ordering that imposes O(S2S) linear\nconstraints for each lattice [1]. We solved it with consensus optimization and alternating direction\nmethod of multipliers [18] to parallelize the projection computations with a convergence criterion of\n\u0001 = 10\u22127. This can be done in O(S2S log(1/\u0001)).\nInitialization: For linear embedding layers, we initialize each component in the linear embedding\nmatrix with IID Gaussian noise N (2, 1). The initial mean of 2 is to bias the initial parameters\nto be positive so that they are not clipped to zero by the \ufb01rst monotonicity projection. However,\nbecause the calibration layer before the linear embedding outputs in [0, 1] and thus is expected to\nhave output E[xt] = 0.5, initializing the linear embedding with a mean of 2 introduces an initial bias:\nE[xt+1] = E[Wtxt] = Dt. To counteract that we initialize each component of the bias vector, bt, to\n\u2212Dt, so that the initial expected output of the linear layer is E[xt+1] = E[Wtxt + bt] = 0.\nWe initialize each lattice\u2019s parameters to be a linear function spanning [0, 1], and add IID Gaussian\nnoise N (0, 1\nS2 ) to each parameter, where S is the number of input to a lattice. We initialize each\ncalibrator to be a linear function that maps [xmin, xmax] to [0, 1] (and did not add any noise).\n\n6 Experiments\n\nWe present results on the same benchmark dataset (Adult) with the same monotonic features as\nin Canini et al.\n[3], and for three problems from Google where the monotonicity constraints\nwere speci\ufb01ed by product groups. For each experiment, every model considered is trained with\nmonotonicity guarantees on the same set of inputs. See Table 2 for a summary of the datasets.\n\n6\n\n\fTable 2: Dataset Summary\n\nType\nDataset\nClassify\nAdult\nClassify\nUser Intent\nRater Score Regress\nUsefulness\nClassify\n\n# Features (# Monotonic)\n90 (4)\n49 (19)\n10 (10)\n9 (9)\n\n# Training\n26,065\n241,325\n1,565,468\n62,220\n\n# Validation\n6,496\n60,412\n195,530\n7,764\n\n# Test\n16,281\n176,792\n195,748\n7,919\n\nTable 3: User Intent Case Study Results\n\nValidation\nTest\nAccuracy Accuracy\n72.48%\n72.01%\n72.02%\n\n74.39%\n74.24%\n73.89%\n\n# Parameters\n\n27,903\n15,840\n31,500\n\nG \u00d7 S\n\n30 \u00d7 5D\n80 \u00d7 7D\n90 \u00d7 7D\n\nDLN\nCrystals\nMin-Max network\n\nFor classi\ufb01cation problems, we used logistic loss, and for the regression, we used squared error. For\neach problem, we used a validation set to optimize the hyperparameters for each model architecture:\nthe learning rate, the number of training steps, etc. For an ensemble of lattices, we tune the number of\nlattices, G, and number of inputs to each lattice, S. All calibrators for all models used a \ufb01xed number\nof 100 keypoints, and set [\u2212100, 100] as an input range.\nIn all experiments, we use the six-layer DLN architecture: Calibrators \u2192 Linear Embedding \u2192\nCalibrators \u2192 Ensemble of Lattices \u2192 Calibrators \u2192 Linear Embedding, and validate the number\nof lattices in the ensemble G, number of inputs to each lattice, S, the Adam stepsize and number of\nloops.\nFor crystals [3] we validated the number of ensembles, G, and number of inputs to each lattice, S,\nas well as Adam stepsize and number of loops. For min-max net [12], we validated the number of\ngroups, G, and dimension of each group S, as well as Adam stepsize and number of loops.\nFor datasets where all features are monotonic, we also train a deep neural network with a non-negative\nweight matrix and ReLU as an activation unit with a \ufb01nal fully connected layer with non-negative\nweight matrix, which we call monotonic DNN, akin to the proposals of [7, 8, 9, 10, 11]. We tune the\ndepth of hidden layers, G, and the activation units in each layer S.\nAll the result tables are sorted by their validation accuracy, and contain an additional column for\nchosen hyperparameters; 2 \u00d7 5D means G = 2 and S = 5.\n\n6.1 User Intent Case Study (Classi\ufb01cation)\n\nFor this real-world Google problem, the problem is to classify the user intent. This experiment\nis set-up to test generalization ability to non-IID test data. The train and validation examples are\ncollected from the U.S., and the test set is collected from 20 other countries, and as a result of this\ndifference between the train/validation and test distributions, there is a notable difference between\nthe validation and the test accuracy. The results in Table 3 show a 0.5% gain in test accuracy for the\nDLN.\n\n6.2 Adult Benchmark Dataset (Classi\ufb01cation)\n\nWe compare accuracy on the benchmark Adult dataset [19], where a model predicts whether a\nperson\u2019s income is at least $50,000 or not. Following Canini et al. [1], we require all models to be\nmonotonically increasing in capital-gain, weekly hours of work and education level, and the gender\nwage gap. We used one-hot encoding for the other categorical features, for 90 features in total. We\nrandomly split the usual train set [19] 80-20 and trained over the 80%, and validated over the 20%.\n\n7\n\n\fTable 4: Adult Results\n\nValidation\nTest\nAccuracy Accuracy\n86.08%\n85.87%\n84.63%\n\n86.50%\n86.02%\n85.28%\n\n# Parameters\n\n40,549\n3,360\n57,330\n\nG \u00d7 S\n\n70 \u00d7 5D\n60 \u00d7 4D\n70 \u00d7 9D\n\nDLN\nCrystals\nMin-Max network\n\nResults in Table 4 show the DLN provides better validation and test accuracy than the min-max\nnetwork or crystals.\n\n6.3 Rater Score Prediction Case Study (Regression)\n\nFor this real-world Google problem, we train a model to predict a rater score for a candidate result,\nwhere each rater score is averaged over 1-5 raters, and takes on 5-25 possible real values. All 10\nmonotonic features are required to be monotonic. Results in Table 5 show the DLN has very test\nMSE than the two-layer crystals model, and much better MSE than the other monotonic networks.\n\nTable 5: Rater Score Prediction (Monotonic Features Only) Results\n\nDLN\nCrystals\nMin-Max network\nMonotonic DNN\n\nValidation MSE Test MSE # Parameters\n81,601\n1,980\n5,500\n2,341\n\n1.2078\n1.2101\n1.3474\n1.3920\n\n1.2096\n1.2109\n1.3447\n1.3939\n\nG \u00d7 S\n50 \u00d7 9D\n10 \u00d7 7D\n100 \u00d7 5D\n20 \u00d7 100D\n\n6.4 Usefulness Case Study (Classi\ufb01er)\n\nFor this real-world Google problem, we train a model to predict whether a candidate result adds\nuseful information given the presence of another result. All 9 features are required to be monotonic.\nTable 6 shows the DLN has slightly better validation and test accuracy than crystals, and both are\nnotably better than the min-max network or positive-weight DNN.\n\nTable 6: Usefulness Results\n\nValidation\nTest\nAccuracy Accuracy\n65.26%\n65.13%\n63.65%\n62.88%\n\n66.08%\n65.45%\n64.62%\n64.27%\n\n# Parameters\n\nG \u00d7 S\n\n81,051\n9,920\n4,200\n2,012\n\n50 \u00d7 9D\n80 \u00d7 6D\n70 \u00d7 6D\n1 \u00d7 1000D\n\nDLN\nCrystals\nMin-Max network\nMonotonic DNN\n\n7 Conclusions\n\nIn this paper, we proposed combining three types of layers, (1) calibrators, (2) linear embeddings,\nand (3) multidimensional lattices, to produce a new class of models we call deep lattice networks that\ncombines the \ufb02exibility of deep networks with the regularization, interpretability and debuggability\nadvantages that come with being able to impose monotonicity constraints on some inputs.\n\n8\n\n\fReferences\n[1] M. R. Gupta, A. Cotter, J. Pfeifer, K. Voevodski, K. Canini, A. Mangylov, W. Moczydlowski,\nand A. Van Esbroeck. Monotonic calibrated interpolated look-up tables. Journal of Machine\nLearning Research, 17(109):1\u201347, 2016.\n\n[2] E. K. Garcia and M. R. Gupta. Lattice regression. In Advances in Neural Information Processing\n\nSystems (NIPS), 2009.\n\n[3] K. Canini, A. Cotter, M. M. Fard, M. R. Gupta, and J. Pfeifer. Fast and \ufb02exible monotonic\nfunctions with ensembles of lattices. Advances in Neural Information Processing Systems\n(NIPS), 2016.\n\n[4] A. Howard and T. Jebara. Learning monotonic transformations for classi\ufb01cation. Advances in\n\nNeural Information Processing Systems (NIPS), 2007.\n\n[5] L. Lov\u00e1sz. Submodular functions and convexity. In Mathematical Programming The State of\n\nthe Art, pages 235\u2013257. Springer, 1983.\n\n[6] P. Groeneboom and G. Jongbloed. Nonparametric estimation under shape constraints. Cam-\n\nbridge Press, New York, USA, 2014.\n\n[7] N. P. Archer and S. Wang. Application of the back propagation neural network algorithm with\nmonotonicity constraints for two-group classi\ufb01cation problems. Decision Sciences, 24(1):60\u201375,\n1993.\n\n[8] S. Wang. A neural network method of density estimation for univariate unimodal data. Neural\n\nComputing & Applications, 2(3):160\u2013167, 1994.\n\n[9] H. Kay and L. H. Ungar. Estimating monotonic functions and their bounds. AIChE Journal,\n\n46(12):2426\u20132434, 2000.\n\n[10] C. Dugas, Y. Bengio, F. B\u00e9lisle, C. Nadeau, and R. Garcia. Incorporating functional knowledge\n\nin neural networks. Journal Machine Learning Research, 2009.\n\n[11] A. Minin, M. Velikova, B. Lang, and H. Daniels. Comparison of universal approximators\n\nincorporating partial monotonicity by structure. Neural Networks, 23(4):471\u2013475, 2010.\n\n[12] H. Daniels and M. Velikova. Monotone and partially monotone neural networks. IEEE Trans.\n\nNeural Networks, 21(6):906\u2013917, 2010.\n\n[13] J. Sill. Monotonic networks. Advances in Neural Information Processing Systems (NIPS), 1998.\n[14] W. W. Armstrong and M. M. Thomas. Adaptive logic networks. Handbook of Neural Computa-\n\ntion, Section C1. 8, IOP Publishing and Oxford U. Press, ISBN 0 7503 0312, 3, 1996.\n\n[15] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,\nJ. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Joze-\nfowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man\u00e9, R. Monga, S. Moore, D. Murray, C. Olah,\nM. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Va-\nsudevan, F. Vi\u00e9gas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng.\nTensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available\nfrom tensor\ufb02ow.org.\n\n[16] D. Kingma and J. Ba. Adam: A method for stochastic optimization.\n\narXiv:1412.6980, 2014.\n\narXiv preprint\n\n[17] M. Ayer, H. D. Brunk, G. M. Ewing, W. T. William, E. Silverman, et al. An empirical\ndistribution function for sampling with incomplete information. The annals of mathematical\nstatistics, 26(4):641\u2013647, 1955.\n[18] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical\nlearning via the alternating direction method of multipliers. Foundations and Trends R(cid:13) in\nMachine Learning, 3(1):1\u2013122, 2011.\n\n[19] C. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.\n\n9\n\n\f", "award": [], "sourceid": 1713, "authors": [{"given_name": "Seungil", "family_name": "You", "institution": "Google"}, {"given_name": "David", "family_name": "Ding", "institution": "Google"}, {"given_name": "Kevin", "family_name": "Canini", "institution": "Google"}, {"given_name": "Jan", "family_name": "Pfeifer", "institution": "Google"}, {"given_name": "Maya", "family_name": "Gupta", "institution": "Google"}]}