{"title": "Neural Interaction Transparency (NIT): Disentangling Learned Interactions for Improved Interpretability", "book": "Advances in Neural Information Processing Systems", "page_first": 5804, "page_last": 5813, "abstract": "Neural networks are known to model statistical interactions, but they entangle the interactions at intermediate hidden layers for shared representation learning. We propose a framework, Neural Interaction Transparency (NIT), that disentangles the shared learning across different interactions to obtain their intrinsic lower-order and interpretable structure. This is done through a novel regularizer that directly penalizes interaction order. We show that disentangling interactions reduces a feedforward neural network to a generalized additive model with interactions, which can lead to transparent models that perform comparably to the state-of-the-art models. NIT is also flexible and efficient; it can learn generalized additive models with maximum $K$-order interactions by training only $O(1)$ models.", "full_text": "Neural Interaction Transparency (NIT):\nDisentangling Learned Interactions for\n\nImproved Interpretability\n\nMichael Tsang1, Hanpeng Liu1, Sanjay Purushotham1, Pavankumar Murali2, and Yan Liu1\n\n1University of Southern California\n\n2IBM T.J. Watson Research Center\n\n{tsangm,hanpengl,spurusho,yanliu.cs}@usc.edu, pavanm@us.ibm.com\n\nAbstract\n\nNeural networks are known to model statistical interactions, but they entangle the\ninteractions at intermediate hidden layers for shared representation learning. We\npropose a framework, Neural Interaction Transparency (NIT), that disentangles the\nshared learning across different interactions to obtain their intrinsic lower-order\nand interpretable structure. This is done through a novel regularizer that directly\npenalizes interaction order. We show that disentangling interactions reduces a\nfeedforward neural network to a generalized additive model with interactions,\nwhich can lead to transparent models that perform comparably to the state-of-the-\nart models. NIT is also \ufb02exible and ef\ufb01cient; it can learn generalized additive\nmodels with maximum K-order interactions by training only O(1) models.\n\n1\n\nIntroduction\n\nFeedforward neural networks are typically viewed as powerful predictive models possessing the\nability to universally approximate any function [13]. Because neural networks are increasingly used\nin critical domains, including healthcare and \ufb01nance [2, 22, 9, 5, 30], there is a strong desire to\nunderstand how they make predictions. One of the key issues preventing neural networks from being\nvisualizable and understandable is that they assume the variable relationships in data are extremely\nhigh-dimensional and complex [23]. Speci\ufb01cally, each hidden neuron takes as input all nodes from\nthe previous layer and creates a high-order interaction between these nodes.\n\nA fundamental challenge facing the interpretability of neural networks is the entangling of feature\ninteractions within the networks. An interaction is entangled in a neural network if any hidden unit\nlearns an interaction out of separate true feature interactions. For example, suppose that a feedforward\nneural network is trained on data consisting of two multiplicative pairwise interactions, x1x2 and\nx3x4. The neural network entangles the interactions if any hidden unit learns an interaction between\nall four variables {1, 2, 3, 4}. Figure 1 shows how this interaction entangling is generally true.\nAlthough a relatively small percentage of hidden units in the \ufb01rst hidden layer entangle the pairwise\ninteractions, as soon as the second hidden layer, nearly all hidden units entangle the interactions when\nat least one of them is present. Previous works have studied how to disentangle factors of variation\nin neural networks [4, 18, 14], but none of them have addressed the entangling of interactions. It\nremains an open problem of how to learn an interpretable neural network by disentangling feature\ninteractions.\n\nIn this work, we propose the Neural Interaction Transparency (NIT) framework to learn interpretable\nfeedforward neural networks. NIT learns to separate feature interactions within the neural network by\nnovel use of regularization. The framework corresponds to reducing a feedforward neural network into\na generalized additive model with interactions [20, 2], which can lead to an accurate and transparent\nmodel [30, 29]. By disentangling interactions, we are able to learn this exact model faster, and it is\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: A demonstration of interaction entangling in a feedforward neural network. When trained\non a simple dataset generated by x1x2 + x3x4 under common forms of regularization, the neural\nnetwork tends to keep the interactions separated in the \ufb01rst hidden layer but entangles them in the\nsecond layer. In contrast, NIT (black and cyan) is able to fully disentangle the interactions. The\nmeaning of entangling is detailed in \u00a73.\n\nvisualizable. Our framework is \ufb02exible to disentangle interactions up to a user-speci\ufb01ed maximum\norder. Our contributions are as follows: 1) we demonstrate that feedforward neural networks entangle\ninteractions under sparsity regularization, 2) we develop a novel architecture and regularization\nframework, NIT, for disentangling the interactions to be order K at maximum, and 3) we construct a\ngeneralized additive model with interactions by training only O(1) models.\n\n2 Related Works and Notations\n\nIn this section, we brie\ufb02y discuss related literature on interpretability and disentanglement in neural\nnetworks, and generalized additive models.\n\nInterpretability: Various methods [7, 11, 23, 27, 3, 31, 29] exist to interpret feedforward neural\nnetworks; however, none so far has attempted to learn or uncover a generalized additive model with\ninteractions as a subset of a feedforward network with fully-connected layers. The closest work to\nour proposed framework is recent research on detecting statistical interactions from the weights of\na feedforward network [31]. Because this research only extracted interactions created at the \ufb01rst\nhidden layer, it is unknown how interactions propagate through the intermediate hidden layers of the\nnetwork. Other approaches to interpreting feedforward neural networks include extracting per-sample\nfeature importance via inputs gradients [11] and global feature importance via network weights [7].\nInterpretations can also be obtained by \ufb01tting simpler interpretable models [27, 3, 29] to neural\nnetwork predictions, models of which include decision trees [3] and GA2Ms [29]. Lastly, feedforward\nneural networks have been used to learn generalized additive models before [25, 31], however their\nconstruction either did not learn interactions [25] or were computationally expensive [31].\n\nDisentanglement: The topic of disentanglement in neural networks comes in varied forms, but\nit is often studied to extract an interpretable representation from neural networks. For example,\nresearch on disentangling factors of variation has focused on modifying the training of deep networks\nto identify interpretable latent codes [4, 18, 14, 12] and developing ways to interpret intrinsically\ndisentangled representations after training [26, 1]. While our work studies a different objective to\nachieve disentanglement, it explores the same two approaches to reach our objective.\n\nGeneralized Additive Models (GAMs): GAMs are typically viewed as powerful interpretable\nmodels in machine learning and statistics [10]. The standard form of this model only constructs\n\nunivariate functions of features, namely g(E[y]) =P fi(xi), where g is a link function and fi can\n\nbe any arbitrary nonlinear function. This model can easily be visualized because each fi can be\nindependently plotted by varying each function\u2019s input and observing the response on the output.\nPrevious research [20] identi\ufb01ed that GAMs in their standard form are limited in their expressive\npower by not learning interactions between features inherent in data. As a result, a class of models,\nGA2M, was introduced in [20] to include pairwise interaction complexity into the GAM, in the form\n\nof g(E[y]) =P fi(xi) +P fij(xi, xj). The addition of pairwise interactions was of special interest\n\nbecause they are still visualizable, but now in the form of heatmaps [2].\n\nUnlike previous works which investigated disentanglement in neural networks for interpretability,\nour work studies disentangling interactions in neural networks to uncover GAMs.\n\n2\n\n107105103101lower threshold on weight magnitude0255075100% neurons entangledfirst layer, L1second layer, L1first layer, L2second layer, L2first layer,no regularizationsecond layer,no regularizationNIT, first layerNIT, second layer\f2.1 Notations\n\nVectors are represented by boldface lowercase letters, such as x, w; matrices are represented by\nboldface capital letters, such as W. The i-th entry of a vector w is denoted by wi, and element (i, j)\nof a matrix W is denoted by Wij .\n\nLet p be the number of features and N be the number of samples in a dataset. An interaction, I, is a\nsubset of all input features: I \u2286 {1, 2, . . . , p}, where the interaction order |I| is always greater than\nor equal to 2. A univariate variable, u, is a single feature that does not interact with other variables.\nFor a vector x \u2208 Rp, let xI \u2208 R|I| be the vector restricted to the dimensions speci\ufb01ed by I, and\nsimilarly let xu be the scalar restricted to the dimension speci\ufb01ed by u.\n\nWe let a feedforward neural network and model learned by our proposed method (denoted by NIT)\nhave the same notations. Consider a feedforward neural network f (\u00b7) with L hidden layers and\nthe parameters: L + 1 weight matrices W(\u2113) \u2208 Rp\u2113\u00d7p\u2113\u22121 and L + 1 bias vectors b(\u2113) \u2208 Rp\u2113 ,\n\u2113 = 1, 2, . . . , L + 1. Let p\u2113 be the number of hidden units in the \u2113-th layer. We treat the input features\nas the 0-th layer and p0 = p as the number of input features, and we treat the output as the (L + 1)-th\nlayer with pL+1 = 1. The activation function used, \u03d5 (\u00b7), is ReLU nonlinearity. Then the hidden\nunits h(\u2113) of the neural network and the output y with input x \u2208 Rp can be expressed as:\n\nh(0) = x,\n\ny = w(L+1)h(L) + b(L+1), h(\u2113) = \u03d5(cid:16)W(\u2113)h(\u2113\u22121) + b(\u2113)(cid:17) ,\n\n\u2200\u2113 = 1, 2, . . . , L.\n\nPlease note that an \u201cinteraction\u201d in this paper is a statistical interaction that describes a non-additive\nin\ufb02uence [28] between features on a single outcome or prediction variable.\n\n3\n\nInteraction Entangling in Neural Networks\n\nWe de\ufb01ne a neural network entangling interactions in the following way:\nDe\ufb01nition 1 (Entangled Interactions). Let S be the set of all true feature interactions {Ii}|S|\ni=1 in\na dataset, and let h be any hidden unit in a feedforward neural network. Let h capture a feature\ninteraction \u02c6I if and only if there exists nonzero weighted paths between h and each interacting\nfeature and between h and the output y. If h captures \u02c6I such that for any two different interactions\nIi, Ij \u2208 S the following is true Ii ( \u02c6I and Ij ( \u02c6I, then the hidden unit entangles interactions, and\ncorrespondingly the neural network entangles interactions.\n\nFor example, suppose we have a dataset D that contains N samples and 4 features. Each label is\ngenerated by the function y = x1x2 + x3x4 where each feature xi, i = 1, . . . , 4 is i.i.d. with uniform\ndistribution between \u22121 and 1. As a result, D contains two pairwise interactions: {1, 2}, {3, 4}.\nConsider a neural network f (\u00b7), which is trained on the dataset D to predict the label. If any hidden\nunit learns the interaction {1, 2, 3, 4}, then it has entangled interactions.\n\nWe desire a feedforward neural network that does not entangle interactions at any hidden layer so that\neach interaction is separated in additive form. Speci\ufb01cally, we aim to learn a function in the form:\n\n\u02dcf (x) =\n\nR\n\nXi=1\n\nS\n\nXi=1\n\nri(xI) +\n\nsi(xu),\n\n(1)\n\ni=1 is a set of interactions, {ui}S\n\nwhere {Ii}R\ni=1 is a set of univariate variables, and r(\u00b7) and s(\u00b7) can\nbe any arbitrary functions of respective inputs. For example, we would like our previous f (\u00b7) to\nbe decomposed into an addition of two functions, e.g. r1({1, 2}) + r2({3, 4}), where both r1, r2\nperform multiplication. A model that learns the additive function in Eq. 1 is a generalized additive\nmodel with interactions.\n\nA recent work [31] has shown that the weights to the \ufb01rst hidden layer are exceptional at detecting\ninteractions in data using common weight regularization techniques like L1 or L2. Even when\nassuming that each hidden unit in the \ufb01rst layer modeled one interaction, interaction detection\nwas still very accurate. These results lead to the question of whether neural networks automatically\nseparate out interactions at all hidden layers as like Figure 2b when common regularization techniques\nare applied.\n\nTo test this hypothesis, we train 10 trials of ReLU-based feedforward networks of size 4-100-100-100-\n100-1 on dataset D with N = 3e4 at equal train/validation/test splits and different regularizations,\n\n3\n\n\f(a)\n\n(b)\n\nFigure 2: An illustrative comparison between two simple feedforward networks trained on data\nwith interactions {1, 2} and {3, 4}. (a) A standard feedforward neural network, and (b) a desirable\nnetwork architecture that separates the two interactions. All hidden neurons have ReLU activation,\nand y is a linear neuron which can precede a sigmoid link function if classi\ufb01cation is desired.\n\nwhich were tuned on the validation set. We then calculate the percentage of hidden units entangling\ninteractions in the \ufb01rst and second hidden layers at different lower thresholds on the magnitudes of all\nweights (i.e. weight magnitudes below the threshold are zeroed) (see Figure 1). We note that when\ncalculating percentages, if a hidden unit does not learn any of the true interactions, even a superset\nof one, then that hidden unit is ignored. Consistent with [31], the \ufb01rst hidden layer with L1 or L2\nregularization is capable of keeping the interactions separated, but at the second hidden layer, nearly\nall hidden units entangle the interactions at every threshold when at least one of the interactions\nis modeled. Therefore, the regularization does very little to prevent the pairwise interactions from\nentangling within the neural network.\n\nOur understanding that feedforward networks entangle interactions even in the simple setting with\ntwo multiplicative interactions motivates the need to disentangle them.\n\n4 Disentangling Interactions for Interpretability\n\nIn this section, we explain our architecture and regularization framework, NIT, to disentangle\ninteractions in a feedforward neural network.\n\n4.1 Architecture Choice\n\nIn order to disentangle interactions, we \ufb01rst propose a feedforward network modi\ufb01cation that has a\ndense input weight matrix followed by multiple network blocks, as depicted in Figure 3. Our choice of\na dense input weight matrix follows the recent research that a sparse regularized input weight matrix\ntends to automatically separate out different true feature interactions to different \ufb01rst layer hidden\nunits [31]. Separate block networks at upper layers are used to force the representation learning of\nseparate interactions to be disentangled of each other. The remaining challenge is to ensure that each\nblock only learns one interaction or univariate variable to generate the desired GAM structure (Eq.\n1) 1. Note that in this model, the number of blocks, B, must be pre-speci\ufb01ed. Two approaches to\nselecting B are either choosing it to be large and letting sparse regularization cancel unused blocks,\nor setting B to be small in case a small number of blocks is desired for human interpretation.\n\nIn the experiments section (\u00a75.5), we additionally discuss our results on an approach that does\nnot require any pre-speci\ufb01cation of network blocks. Instead, this approach attempts to separate\ninteractions through every layer during training rather than rely on the forced separation of blocks.\n\n4.2 Disentangling Regularization\n\nAs mentioned in the previous section (\u00a74.1), each network block must learn one interaction or\nunivariate variable. Formally, each input hidden unit h \u2208 {hi}p1/B\ni=1 must learn\ni=1\nthe same interaction I \u2208 {Ii}R\ni=1 or univariate variable u \u2208 {ui}S\ni=1 where R + S \u2264 B. We propose\nto learn such a model by way of regularization that explicitly de\ufb01nes maximum allowable interaction\n\nto block b \u2208 {bi}B\n\n1It is possible that multiple blocks learn the same interaction or univariate variable, and some blocks can\nlearn a subset of what other blocks learn. In such cases, the duplicate or redundant blocks can be combined. We\nleave the prevention of this type of learning for future work.\n\n4\n\n\ud835\udc654\ud835\udc66\ud835\udc651\ud835\udc653\ud835\udc652\ud835\udc66\ud835\udc651\ud835\udc653\ud835\udc652\ud835\udc654\fFigure 3: A version of our NIT model architecture. Here, NIT consists of B mutli-layer network\nblocks above a common input dense layer. Appropriate regularization on the dense layer forces each\nblock to model a single interaction or univariate variable. This model can equivalently be seen as a\nstandard feedforward neural network with block diagonal weight matrices at intermediate layers.\n\norders. Fixing the maximum order to be 2 has been the standard in the construction of GAMs with\npairwise interactions [20].\n\nThe existence of interactions or univariate variables being modeled by \ufb01rst layer hidden units is\ndetermined by the nonzero weights entering those units [31]. Therefore, we would like to penalize the\nnumber of nonzero elements in rows of the input weight matrix, as a group belonging to a block. The\nnumber of nonzero elements can be obtained by using L0 regularization, however, it is known that\nL0 is non-differentiable and cannot be used in gradient-based regularization [21]. Recently, Louizos,\net al. [21] developed a differentiable surrogate to the L0 norm by smoothing the expected L0 and\nusing approximately binary gates g \u2208 Rn to determine which parameters \u03b8 \u2208 Rn to set to zero. The\nparameters \u03b8 of a hypothesis \u02c6h are then re-parameterized as separate parameters \u02dc\u03b8 and \u03c6, such that\nfor a dataset of N samples {(x1, y1), . . . , (xN , yN )}, Empirical Risk Minimization becomes:\n\nR( \u02dc\u03b8, \u03c6) =\n\n1\n\nN N\nXi=1\n\nL(cid:16)\u02c6h(xi; \u02dc\u03b8 \u2299 g(\u03c6), yi(cid:17)! + \u03bb\n\ngj(\u03c6j),\n\nn\n\nXj=1\n\nwhich follows the ideal reparameterization of \u03b8 into \u02dc\u03b8 and g as:\n\n\u03b8j = \u02dc\u03b8jgj,\n\ngj \u2208 {0, 1},\n\n\u02dc\u03b8j 6= 0,\n\nk\u03b8k0 =\n\ngj\n\nn\n\nXj=1\n\n(2)\n\n(3)\n\nWe note that Eq. 2 is not in its \ufb01nal form for clarity and in fact uses distributions for the gates\nto enable differentiability and exact zeros in parameters; we refer interested readers to [21]. We\npropose a disentangled group regularizer denoted by LK, to disentangle feature interactions in the\nNIT framework. LK is designed to be a group version of the smoothed L0 regularization. Let\nG, \u03a6 \u2208 RB\u00d7p be matrix versions of the vectors g and \u03c6 from Eq. 2. Let T : RB\u00d7p \u2192 Rp1\u00d7p\nassign the same gate to all \ufb01rst layer hidden units in a block corresponding to a single feature, for all\nsuch groups of hidden units in every block. Just as \u02dc\u03b8j 6= 0 in Eq. 3, \u02dcW (1)\n6= 0, \u2200i = 1, . . . , p1 and\nij\n\u2200j = 1, . . . , p. Then, the cost function for our NIT model f (\u00b7) has the form:\n\nRNIT =\n\n1\n\nN N\nXi=1\n\n(1)\n\nL(cid:16)f (xi; \u02dcW\n\n\u2299T (G(\u03a6)), {W(\u2113)}L+1\n\n\u2113=2 , {b(\u2113)}L+1\n\n\u2113=1 , yi(cid:17)! + LK,\n\n(4)\n\nwhere LK is our proposed disentangling regularizer. Since G is \u2248 1 when a feature is ac-\ntive in a block and \u2248 0 otherwise, the estimated interaction order of a block is de\ufb01ned as\n\u2736(\u02c6ki 6= 0). Then, we can conveniently\nlearn generalized additive models with desirable properties by including two terms in our regularizer:\n\ni=1\n\n\u02c6ki = Pp\n\nj=1 Gij(\u03a6ij) \u2200i = 1, 2, . . . , B. Let \u02dcB = PB\nLK = maxn(cid:16)max\n\u02c6ki(cid:17) \u2212 K, 0o\n{z\n}\n\nlimits the maximum interaction order to be K\n\n|\n\ni\n\n5\n\n\u02c6ki\n\n(5)\n\n+\n\n\u03bb\n\n1\n\u02dcB\n\nB\n\nXi=1\n{z\n\n}\n\n|\n\nencourages smaller interaction\n\norders and block sparsity\n\n\ud835\udc654\ud835\udc653\ud835\udc66\ud835\udc651\ud835\udc652\u202612BDense Layer to Disentangle\fTable 1: A comparison of the number of models needed to construct various forms of GAMs.\nGAK M all interactions refers to constructing a GAM for every interaction of order \u2264 K respectively.\nMLPcutoff is an additive model of Multilayer Perceptrons, and \u03b7 is the top number of interactions\nbased on a learned cutoff.\n\nFramework\n\nGAM\n\nGAK M\n\noriginal [10]\n\nall interactions\n\n# models\n\nO(p)\n\nO(pK)\n\nGA2M MLPcutoff\n[20]\nO(p2)\n\nO(\u03b7)\n\n[31]\n\nNIT\n\n(proposed)\n\nO(1)\n\nThe \ufb01rst term is responsible for penalizing the maximum interaction order during training to be a\npre-speci\ufb01ed positive integer K. A threshold at K is enforced by a recti\ufb01er [8], which is used for\nits differentiability and sharp on/off switching behavior at K. The second term both penalizes the\naverage non-zero interaction order over all blocks and sparsi\ufb01es unused blocks.\n\n4.3 Advantages of our NIT Framework\n\nConstructing generalized additive models by disentangling interactions has important advantages,\nthe main of which is that our NIT framework can learn the GAM in O(1) models, whereas previous\nmethods either needed to learn a model for each interaction, or in the case of traditional univariate\nGAMs [10], learn a model for each feature (Table 1). Our approach can construct the GAM quickly\nbecause it leverages gradient based optimization to determine interaction separations.\n\nIn addition to ef\ufb01ciency, our approach is the \ufb01rst to investigate setting hard maximum thresholds on\ninteraction order, for any order K. This is a straightforward result of our regularizer formulation.\n\nPrevious methods [20, 2] have focused on advocating tree-based GAMs to interpret and visualize\nwhat the model learns, and there exist few works which have explored neural network based GAMs\nwith interactions for interpretability. While the tree-based GAMs can provide interpretability, their\nvisualization can appear jagged since decision trees divide their feature space into axis-parallel rectan-\ngles which may not be user-friendly [20]. This kind of jagged visualization is not a problem for neural\nnetwork-based GAMs, which as a result can produce smoother and more intuitive visualizations.\n\n5 Experiments\n\n5.1 Experimental Setup\n\nWe validate the ef\ufb01cacy of our method \ufb01rst on synthetic data and then on four real-world datasets.\n\nOur experiment on synthetic data is the baseline disentangling experiment discussed in \u00a73. We use real-\nworld datasets to evaluate the predictive performance of NIT (under restriction of maximum interaction\norder K) as compared to standard and relevant machine learning models: Linear/Logistic Regression\n(LR), GAM [19], GA2M [20], Random Forests (RF), and the Multilayer Perceptron (MLP). We\nconsider standard LR and GAM as the models that do not learn interactions, and GA2M learns up to\npairwise interactions, and RF and MLP are fully-complexity models [20] that can learn all interactions.\nAfter the performance evaluation, we visualize and ex-\namine interactions learned by our NIT from a medical\ndataset. The real-world datasets (Table 2) include two\nregression datasets previously studied in statistical inter-\naction research: Cal Housing [24] and Bike Sharing [6],\nand two binary classi\ufb01cation datasets MIMIC-III [15]\nand CIFAR-10 binary. CIFAR-10 binary is a binary clas-\nsi\ufb01cation dataset (derived from CIFAR-10 [17]) with\ntwo randomly selected classes, which are \u201ccat\u201d and\n\u201cdeer\u201d in our experiments. We study binary classi\ufb01ca-\ntion as opposed to multi-class classi\ufb01cation to minimize learned interaction orders and accord with\nprevious research on GAMs [20]. Root-mean squared error (RMSE) and Area under ROC (AUC) are\nused as the evaluation metrics for the regression and classi\ufb01cation tasks.\n\nTable 2: Real-world datasets\n\nMIMIC-III\nCIFAR-10 binary\n\nCal Housing\nBike Sharing\n\n14.02%\n50.0%\n\n40\n3072\n\nN\n\n20640\n17379\n\n20922\n12000\n\nDataset\n\np\n\n8\n15\n\n%Pos\n\n-\n-\n\nIn all of our NIT models, we set B = 20 and use an equal number of hidden units per network block\nfor any given hidden layer. The hyperparameter \u03bb in our disentangling regularizer (Eq. 5) was found\n\n6\n\n\fTable 3: Predictive performance of NIT. RMSE is calculated from standard scaled outcome variables.\n*GA2M took several days to train and did not converge. Lower RMSE and higher AUC means better\nmodel performance.\n\nModel\n\nCal Housing\n\nBike Sharing\n\nMIMIC-III\n\nCIFAR-10 binary\n\nK\n\nRMSE\n\nK\n\nRMSE\n\nK\n\nAUC\n\nK\n\nAUC\n\nLR\nGAM\nGA2M\n\nNIT\n\nRF\nMLP\n\n2\n3\n4\n\n0.60 \u00b1 0.016\n0.506 \u00b1 0.0078\n0.435 \u00b1 0.0077\n\n0.448 \u00b1 0.0080\n0.437 \u00b1 0.0077\n\n0.43 \u00b1 0.013\n\n0.435 \u00b1 0.0095\n0.445 \u00b1 0.0081\n\n2\n3\n4\n\n0.78 \u00b1 0.021\n0.55 \u00b1 0.016\n0.307 \u00b1 0.0080\n\n0.31 \u00b1 0.013\n0.26 \u00b1 0.015\n0.240 \u00b1 0.0097\n\n0.243 \u00b1 0.0053\n0.22 \u00b1 0.012\n\n2\n4\n6\n\n0.70 \u00b1 0.013\n0.75 \u00b1 0.015\n0.73 \u00b1 0.012\n\n0.76 \u00b1 0.011\n0.76 \u00b1 0.013\n0.77 \u00b1 0.011\n\n0.685 \u00b1 0.0087\n0.771 \u00b1 0.0096\n\n10\n15\n20\n\n0.676 \u00b1 0.0072\n0.829 \u00b1 0.0014\n\n*\n\n0.849 \u00b1 0.0049\n0.858 \u00b1 0.0020\n0.860 \u00b1 0.0034\n\n0.793 \u00b1 0.0034\n0.860 \u00b1 0.0046\n\nby running a grid search for validation performance on each fold of a 5-fold cross-validation 2. In our\nexperiments, we don\u2019t assume K, so we report the performances of NIT when varying K. Learning\nrate was \ufb01xed at 5e\u22122 while the disentangling regularization was applied. For the hyperparameters\nof baselines and those not speci\ufb01c to NIT, tuning was done on the validation set. For all experiments\nwith neural nets, we use the ADAM optimizer [16] and early stopping on validation sets.\n\n5.2 Training the NIT Framework\n\nModel training in NIT was conducted in two phases. The \ufb01rst was a disentangling phase where each\nblock learned one interaction or univariate variable. The second phase kept LK = 0 and G \ufb01xed,\nso that G acted as a mask that deactivated multiple features in each block. The second phase of\ntraining starts when the maximum interaction order across all blocks 3 was \u2264 K. and the maximum\ninteraction order of the disentangling phase stabilizes. We also reinitialized parameters between\ntraining phases in case optimization was stuck at local minima.\n\n5.3 Disentangling Experiment\n\nWe revisit the same function x1x2 + x3x4 that the MLP failed at disentangling (\u00a73) and evaluate\nNIT instead. We train 10 trials of NIT with a 4-100-100-100-100-1 architecture like before (\u00a73)\nand a grid search over K. In Figure 1 we show that NIT disentangles the x1x2 and x3x4 pairwise\ninteractions at all possible lower weight thresholds while maintaining a performance (RMSE =\n1.3e \u2212 3) similar to that of MLP. Note that the architecture choice of NIT (Figure 3) automatically\ndisentangles interactions in the entire model when the \ufb01rst two hidden layers are disentangled.\n\n5.4 Real-World Dataset Experiments\n\nOn real-world datasets (Table 2), we evaluate the predictive performance of NIT at different levels of\nK, as shown in Table 3. For the Cal Housing, Bike Sharing, and MIMIC-III datasets, we choose K\nto be 2 \ufb01rst and increase it until NIT\u2019s predictive performance is similar to that of RF or MLP. For\nCIFAR-10 binary, we set K = 10, 15, 20 to demonstrate the capability of NIT to learn high-order\ninteractions. The exact statistics of learned interaction orders for all datasets are shown in appendix\nA in the supplementary materials. For all the datasets, the predictive performance of NIT is either\ncomparable to MLP at low K, or comparable to GA2M at K = 2 and RF/MLP at higher values of\nK, as expected.\n\nIn Figure 4, we provide all visualizations of what NIT learns at K = 2 on one fold of the MIMIC-III\ndataset. MIMIC-III is currently the largest public health records dataset [15], and our prediction\ntask is classifying whether a patient will be re-admitted into an intensive care unit within 30 days.\nSince K = 2, all learned interactions are plotted as heatmaps as shown, and the remaining univariate\nvariables are shown in the left six plots. We notice interesting patterns, for example when a patient\u2019s\n\n2For all real-world datasets except CIFAR-10 binary, 5-fold cross-validation was used, where model training\nwas done on 3 folds, validation on the 4th fold and testing on the 5th. For the CIFAR dataset, the standard test set\nwas only used for testing, and an inner 5-fold cross-validation was done on an 80%-20% train-validation split.\n\n3With this criteria, K 6= p, but the criteria can be changed to allow K = p.\n\n7\n\n\fFigure 4: Visualizations that provide a global [27] and transparent [30] interpretation of NIT trained\non the MIMIC-III dataset at K = 2. Outcome scores are interpreted as contribution to 30-day hospital\nreadmission in the same way described by [2]. The output bias of NIT is 0.21.\n\nminimum temperature rises to \u2248 40\u25e6C, the chance for readmission drops sharply. Another interesting\npattern is in an interaction plot showing as sofa score (which estimates mortality risk) increases, the\nchance for readmission decreases. We checked that these potentially un-intuitive patterns are indeed\nconsistent with those in the actual dataset by examining the frequency of readmission labels relative\nto temperature or sofa score. This insight may warrant further investigation by medical experts.\n\n5.5 Disentangling Interactions Through All Layers\n\nnow for each neuron i in the last hidden layer and \u02c6ki =Pp\n\nIn addition to disentangling at the \ufb01rst weight matrix, we discuss results on modifying NIT to\ndisentangle interactions through all layers\u2019 weight matrices. By doing this, we no longer require\nB network blocks nor group L0. Instead, L0 is applied to each individual weight as done normally\nin [21]. We now de\ufb01ne layer-wise gate matrices G(1), G(2), . . . , G(L) of gates for each weight in\nthe corresponding matrices W(1), W(2), . . . , W(L). The estimated interaction order \u02c6ki from Eq. 5 is\nj=1 [\u03c3(G(L)G(L\u22121) . . . G(1))]ij , where\nnormalized matrix multiplications between G(\u2113)\u2019s are taken. Here, \u03c3 is a sigmoid-type function,\n\u03c3(G\u2032) = G\u2032\nc+|G\u2032| , which approximates a function that sends all elements of G\u2032 greater than or equal\nto 1 to be 1, otherwise 0 (c is a hyperparameter satisfying 0 < c \u226a 1). Theoretical justi\ufb01cation for\nthe formulation of \u02c6ki is provided in Appendix E.\nDisentangling interactions through all layers can perform well in regression tasks. When we let this\napproach discover the max interaction order by setting K = 0 for LK in Eq. 5 and c = 1e \u2212 2, NIT\nis able to reach 0.43 RMSE at max order 3 for Cal Housing, and 0.26 RMSE at max order 6 for Bike\nSharing without re-initializing models. Now without network blocks, NIT architectures are smaller\nthan before (Appendix D), i.e. 8-200-200-1 for Cal Housing and 15-300-200-100-1 for Bike Sharing.\n\n5.6 Limitations\n\nAlthough NIT can learn a GAM with interactions in O(1) models, the disentangling phase in our\ntraining optimization can take longer with increasing p or B. In addition, if K is not pre-speci\ufb01ed, a\nsearch for an optimal K for peak predictive performance can be a slow process when testing each K.\nFinally, since optimization is non-convex, there is no guarantee that correct interactions are learned.\n\n6 Conclusion\n\nWe have investigated an interpretation of a feedforward neural network based on its entangling of\ninteractions, and we proposed our framework, NIT, for disentangling the interactions to learn a\nmore interpretable neural network. To disentangle interactions, we developed a way of maintaining\nseparations in the network and applying a direct penalty on interaction orders. NIT corresponds to\nreducing a feedforward neural network to a generalized additive model (GAM) with interactions,\nallowing it to learn the GAM in O(1) models. In experiments, we have demonstrated the effectiveness\nof NIT at obtaining high predictive performance at different K and the value of NIT for visualization.\nFor future work, we would like to study ways of making NIT perform at state-of-the-art levels for\ninteraction detection in addition to obtaining high predictive performance under K-order constraints.\n\n8\n\n0255075diasbp_min0.200.150.100.05chance of readmission02505007501000platelets_min0.150.100.050.00204060resprate_max0.040.060.080.100.120.1412.0521.81031.51541.2platelets_max0.02.55.07.5sofa0.40.20.00.20.42.074.0146.0218.0urea_n_max1.01.62.22.8albumin_min0.10.00.10.20.30.40255075100sapsii0.10.20.30.4chance of readmission0100200300urea_n_max0.350.300.250.200.150.100.052030temp_min0.250.200.150.100.050.000.051.215.429.643.8magnesium_max1.01.62.22.8albumin_min0.050.000.050.1051.0108.2165.5222.8hr_max0.219.839.459.0sysbp_min0.000.050.100.15\fAcknowledgments\n\nWe thank Umang Gupta and anonymous reviewers for their generous feedback. This work was\nsupported by National Science Foundation awards IIS-1254206, IIS-1539608 and a Samsung GRO\ngrant.\n\nReferences\n\n[1] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection:\nQuantifying interpretability of deep visual representations. In Computer Vision and Pattern\nRecognition (CVPR), 2017 IEEE Conference on, pages 3319\u20133327. IEEE, 2017.\n\n[2] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad.\nIntelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission.\nIn Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, pages 1721\u20131730. ACM, 2015.\n\n[3] Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu. Interpretable deep\nmodels for icu outcome prediction. In AMIA Annual Symposium Proceedings, volume 2016,\npage 371. American Medical Informatics Association, 2016.\n\n[4] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in Neural Information Processing Systems, pages 2172\u20132180, 2016.\n\n[5] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic\ndecision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, pages 797\u2013806. ACM, 2017.\n\n[6] Hadi Fanaee-T and Joao Gama. Event labeling combining ensemble detectors and background\n\nknowledge. Progress in Arti\ufb01cial Intelligence, 2(2-3):113\u2013127, 2014.\n\n[7] G David Garson. Interpreting neural-network connection weights. AI expert, 6(4):46\u201351, 1991.\n\n[8] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse recti\ufb01er neural networks. In\nProceedings of the Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 315\u2013323, 2011.\n\n[9] Bryce Goodman and Seth Flaxman. European union regulations on algorithmic decision-making\n\nand a\" right to explanation\". arXiv preprint arXiv:1606.08813, 2016.\n\n[10] Trevor J Hastie. Generalized additive models.\n\nIn Statistical models in S, pages 249\u2013307.\n\nRoutledge, 2017.\n\n[11] Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint\n\narXiv:1611.07634, 2016.\n\n[12] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nShakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a\nconstrained variational framework. In International Conference on Learning Representations,\n2017.\n\n[13] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks,\n\n4(2):251\u2013257, 1991.\n\n[14] Wei-Ning Hsu, Yu Zhang, and James Glass. Unsupervised learning of disentangled and\ninterpretable representations from sequential data. In Advances in neural information processing\nsystems, pages 1876\u20131887, 2017.\n\n[15] Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad\nGhassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii,\na freely accessible critical care database. Scienti\ufb01c data, 3:160035, 2016.\n\n9\n\n\f[16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[17] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\n[18] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolu-\ntional inverse graphics network. In Advances in Neural Information Processing Systems, pages\n2539\u20132547, 2015.\n\n[19] Yin Lou, Rich Caruana, and Johannes Gehrke.\n\nIntelligible models for classi\ufb01cation and\nregression. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge\ndiscovery and data mining, pages 150\u2013158. ACM, 2012.\n\n[20] Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models with\npairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on\nKnowledge discovery and data mining, pages 623\u2013631. ACM, 2013.\n\n[21] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks\n\nthrough l0 regularization. International Conference on Learning Representations, 2018.\n\n[22] Francisco Louzada, Anderson Ara, and Guilherme B Fernandes. Classi\ufb01cation methods applied\nto credit scoring: Systematic review and overall comparison. Surveys in Operations Research\nand Management Science, 21(2):117\u2013134, 2016.\n\n[23] Julian D Olden and Donald A Jackson. Illuminating the \u201cblack box\u201d: a randomization approach\nfor understanding variable contributions in arti\ufb01cial neural networks. Ecological modelling,\n154(1-2):135\u2013150, 2002.\n\n[24] R Kelley Pace and Ronald Barry. Sparse spatial autoregressions. Statistics & Probability Letters,\n\n33(3):291\u2013297, 1997.\n\n[25] William JE Potts. Generalized additive neural networks. In Proceedings of the \ufb01fth ACM\nSIGKDD international conference on Knowledge discovery and data mining, pages 194\u2013200.\nACM, 1999.\n\n[26] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\n\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[27] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining\nthe predictions of any classi\ufb01er. In Proceedings of the 22nd ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, pages 1135\u20131144. ACM, 2016.\n\n[28] Daria Sorokina, Rich Caruana, Mirek Riedewald, and Daniel Fink. Detecting statistical inter-\nactions with additive groves of trees. In Proceedings of the 25th international conference on\nMachine learning, pages 1000\u20131007. ACM, 2008.\n\n[29] Sarah Tan, Rich Caruana, Giles Hooker, and Albert Gordo. Transparent model distillation.\n\narXiv preprint arXiv:1801.08640, 2018.\n\n[30] Sarah Tan, Rich Caruana, Giles Hooker, and Yin Lou. Detecting bias in black-box models using\ntransparent model distillation. AAAI/ACM Conference on Arti\ufb01cial Intelligence, Ethics, and\nSociety, 2017.\n\n[31] Michael Tsang, Dehua Cheng, and Yan Liu. Detecting statistical interactions from neural\n\nnetwork weights. International Conference on Learning Representations, 2018.\n\n10\n\n\f", "award": [], "sourceid": 2797, "authors": [{"given_name": "Michael", "family_name": "Tsang", "institution": "University of Southern California"}, {"given_name": "Hanpeng", "family_name": "Liu", "institution": "University of Southern California"}, {"given_name": "Sanjay", "family_name": "Purushotham", "institution": "University of Maryland Baltimore County"}, {"given_name": "Pavankumar", "family_name": "Murali", "institution": "IBM"}, {"given_name": "Yan", "family_name": "Liu", "institution": "DiDi AI Labs"}]}