{"title": "Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 400, "page_last": 406, "abstract": null, "full_text": "Modeling High-Dimensional Discrete Data with \n\nMulti-Layer Neural Networks \n\nYoshua Bengio \n\nDept.IRO \n\nUniversite de Montreal \n\nMontreal, Qc, Canada, H3C 317 \n\nbengioy@iro.umontreal.ca \n\nSamy Bengio * \n\nIDIAP \n\nCP 592, rue du Simplon 4, \n1920 Martigny, Switzerland \n\nbengio@idiap.ch \n\nAbstract \n\nThe curse of dimensionality is severe when modeling high-dimensional \ndiscrete data: the number of possible combinations of the variables ex(cid:173)\nplodes exponentially. In this paper we propose a new architecture for \nmodeling high-dimensional data that requires resources (parameters and \ncomputations) that grow only at most as the square of the number of vari(cid:173)\nables, using a multi-layer neural network to represent the joint distribu(cid:173)\ntion of the variables as the product of conditional distributions. The neu(cid:173)\nral network can be interpreted as a graphical model without hidden ran(cid:173)\ndom variables, but in which the conditional distributions are tied through \nthe hidden units. The connectivity of the neural network can be pruned by \nusing dependency tests between the variables. Experiments on modeling \nthe distribution of several discrete data sets show statistically significant \nimprovements over other methods such as naive Bayes and comparable \nBayesian networks, and show that significant improvements can be ob(cid:173)\ntained by pruning the network. \n\n1 Introduction \nThe curse of dimensionality hits particularly hard on models of high-dimensional discrete \ndata because there are many more possible combinations of the values of the variables than \ncan possibly be observed in any data set, even the large data sets now common in data(cid:173)\nmining applications. In this paper we are dealing in particular with multivariate discrete \ndata, where one tries to build a model of the distribution of the data. This can be used for \nexample to detect anomalous cases in data-mining applications, or it can be used to model \nthe class-conditional distribution of some observed variables in order to build a classifier. \nA simple multinomial maximum likelihood model would give zero probability to all of \nthe combinations not encountered in the training set, i.e., it would most likely give zero \nprobability to most out-of-sample test cases. Smoothing the model by assigning the same \nnon-zero probability for all the unobserved cases would not be satisfactory either because \nit would not provide much generalization from the training set. This could be obtained by \nusing a multivariate multinomial model whose parameters B are estimated by the maximum \na-posteriori (MAP) principle, i.e., those that have the greatest probability, given the training \ndata D, and using a diffuse prior PCB) (e.g. Dirichlet) on the parameters. \n\nA graphical model or Bayesian network [6, 5) represents the joint distribution of random \nvariables Zl ... Zn with \n\nP(ZI ... Zn) = II P(ZiIParentsi) \n\nn \n\ni=l \n\n\u00b0Part of this work was done while S.B. was at CIRANO, Montreal, Qc. Canada. \n\n\fModeling High-Dimensional Discrete Data with Neural Networks \n\n401 \n\nwhere Parentsi is the set of random variables which are called the parents of variable i \nin the graphical model because they directly condition Zi, and an arrow is drawn, in the \ngraphical model, to Zi, from each of its parents. A fully connected \"left-to-right\" graphical \nmodel is illustrated in Figure 1 (left), which corresponds to the model \n\nP(ZI . .. Zn) = II P(ZiIZl ... Zi-r) . \n\nn \n\ni = l \n\n(1) \n\nFigure 1: Left: a fully connected \"left-to-right\" graphical model. \nRight: the architecture of a neural network that simulates a ful1y connected \"left-to-right\" \ngraphical model. The observed values Zi = Zi are encoded in the corresponding input \nunit group. hi is a group of hidden units. gi is a group of output units, which depend \non Zl ... Zi -l , representing the parameters of a distribution over Zi. These conditional \nprobabilities P(ZiIZl . . . Zi-r) are multiplied to obtain the joint distribution. \n\nNote that this representation depends on the ordering of the variables (in that all previous \nvariables in this order are taken as parents). We call each combination of the values of \nParentsi a context. In the \"exact\" model (with the full table of all possible contexts) all the \norders are equivalent, but if approximations are used, different predictions could be made \nby different models assuming different orders. \n\nIn graphical models, the curse of dimensionality shows up in the representation of condi(cid:173)\ntional distributions P(Zi IParentsi) where Zi has many parents. If Zj E Parentsi can take \nnj values, there are TI j nj different contexts which can occur in which one would like to \nestimate the distribution of Zi. This serious problem has been addressed in the past by two \ntypes of approaches, which are sometimes combined: \n\n1. Not modeling all the dependencies between all the variables: this is the approach mainly \ntaken with most graphical models or Bayes networks [6, 5] . The set of independencies \ncan be assumed using a-priori or human expert knowledge or can be learned from data. \nSee also [2] in which the set Parentsi is restricted to at most one element, which is \nchosen to maximize the correlation with Zi. \n\n2. Approximating the mathematicalform of the joint distribution with a form that takes only \n\ninto account dependencies of lower order, or only takes into account some of the possi(cid:173)\nble dependencies, e.g., with the Rademacher-Walsh expansion or multi-binomial [1,3], \nwhich is a low-order polynomial approximation of a full joint binomial distribution (and \nis used in the experiments reported in this paper). \n\nThe approach we are putting forward in this paper is mostly of the second category, al(cid:173)\nthough we are using simple non-parametric statistics of the dependency between pairs of \nvariables to further reduce the number of required parameters. \n\nIn the multi-binomial model [3], the joint distribution of a set of binary variables is approx(cid:173)\nimated by a polynomial. Whereas the \"exact\" representation of P( Zl = Z l , ... Zn = zn) \nas a function of Z l . . . Zn is a polynomial of degree n, it can be approximated with a lower \n\n\f402 \n\nY. Bengio and S. Bengio \n\ndegree polynomial, and this approximation can be easily computed using the Rademacher(cid:173)\nWalsh expansion [1] (or other similar expansions, such as the Bahadur-Lazarsfeld ex(cid:173)\npansion [1]). Therefore, instead of having 2n parameters, the approximated model for \nP(Zl , . . . Zn) only requires O(nk) parameters. Typically, order k = 2 is used. The model \nproposed here also requires O(n 2 ) parameters, but it allows to model dependencies be(cid:173)\ntween tuples of variables, with more than 2 variables at a time. \n\nIn previous related work by Frey [4], a fully-connected graphical model is used (see Fig(cid:173)\nure 1, left) but each of the conditional distributions is represented by a logistic, which take \ninto account only first-order dependency between the variables: \nL \n\nP(Zi = llZl ... Zi-d = \n\nZ )' \n\n1 \n1 + exp -Wo -\n\n( \n\nj