{"title": "Probability Estimation from a Database Using a Gibbs Energy Model", "book": "Advances in Neural Information Processing Systems", "page_first": 531, "page_last": 538, "abstract": "", "full_text": "Probability Estimation from a Database \n\nUsing a Gibbs Energy Model \n\nJohn W. Miller \n\nRodney M. Goodman \n\nMicrosoft Research (9/1051) \n\nDept. of Electrical Engineering (116-81) \n\nOne Microsoft Way \nRedmond, WA 98052 \n\nCalifornia Institute of Technology \n\nPasadena, CA 91125 \n\nAbstract \n\nWe present an algorithm for creating a neural network which pro(cid:173)\nduces accurate probability estimates as outputs. The network im(cid:173)\nplements a Gibbs probability distribution model of the training \ndatabase. This model is created by a new transformation relating \nthe joint probabilities of attributes in the database to the weights \n(Gibbs potentials) of the distributed network model. The theory \nof this transformation is presented together with experimental re(cid:173)\nsults. One advantage of this approach is the network weights are \nprescribed without iterative gradient descent. Used as a classifier \nthe network tied or outperformed published results on a variety of \ndatabases. \n\n1 \n\nINTRODUCTION \n\nThis paper addresses the problem of modeling a discrete database. The database \nis viewed as a collection of independent samples from a probability distribution. \nThis distribution is called the underlying distribution. In contrast, the empirical \ndistribution is the distribution obtained if you take independent random samples \nfrom the database (with replacement). The task of creating a probability model \ncan be separated into two parts. The first part is the problem of choosing statistics \nof the samples which are expected to accurately represent the underlying distribu(cid:173)\ntion. The second part is the problem of choosing a model which is consistent with \nthese statistics. Under reasonable assumptions, the optimal solution to the second \nproblem is the method of Maximum Entropy. For a broad class of statistics, the \n\n531 \n\n\f532 \n\nMiller and Goodman \n\nMaximum Entropy solution is a Gibbs probability distribution (Slepian, 1972). In \nthis paper, the background and theoretical result of a transformation from joint \nstatistics to a Gibbs energy (or network weight) representation is presented. We \nthen outline the experimental test results of an efficient algorithm implementing \nthis transform without using gradient descent iteration. \n\n2 BACKGROUND \n\nDefine a set T to be the set of attributes (or fields) in a database. For a particular \nentry (or record) of the database, define the associated set of attribute values to be \nthe configuration W of the attributes. The set of attribute values associated with \na subset beT is called a sub configuration Wb. Using this set notation the Gibbs \nprobability distribution may be defined: \n\nwhere \n\npew) = Z-l . eVT(w) \n\nbCT \n\n(1) \n\n(2) \n\nThe function V is called the energy. The function Jb, called the potential junction, \ndefines a real value for every sub configuration of the set b. Z is the normalizing \nconstant that makes the sum of probabilities of all configurations equal to unity. \n\nPrior work in the neural network literature using the Gibbs distribution (such as \nthe Boltzmann Machine) has primarily used second order models (Jb = 0 if Ibl > 2) \n(Hinton, 1986). By adding new attributes not in the original database, second order \npotentials have been used to model complex distributions. The work presented in \nthis paper, in contrast, uses higher order potentials to model complex probability \ndistributions. We begin by considering the case where every potential of every order \nis used to model the distribution. \n\nThe Principle of Inclusion-Exclusion from set theory states that the following two \nequations are equivalent: \n\ng(A) \n\nf(A) \n\nLf(b) \nb~A \nL(-l)IA-bl g(b). \nbCA \n\n(3) \n\n(4) \n\nThe method of inverting an equation from the form of (3) into one in the form of \n(4) is a special case of Mobius Inversion. Clifford-Hammersley (Kindermann, 1980) \nused this relation to invert formula (2): \n\nJA(w) = L(-l)IA-bl Vb(w) \n\nbCA \n\n(5) \n\nDefine the probability of a sub configuration p(Wb) to be the probability that the \nattributes in set b take on the values defined in the configuration w. Using (1) \nto describe the probability distribution of sub configurations, equation (5) can be \nwritten: \n\nJA(w) = L(-l)IA-b l In( p(Wb\u00bb \n\n(6) \n\nb~A \n\n\fProbability Estimation from a Database Using a Gibbs Energy Model \n\n533 \n\n3 A TRANSFORMATION TO GIBBS POTENTIALS \n\nEquation (6) provides a technique for modeling distributions by potential functions \nrather than directly through the observable joint statistics of sets of attributes. If \nthe model is truncated by setting high order potentials to zero, then the energy \nmodel becomes an estimate of the model obtained by collecting the joint statistics, \nrather than an exact equivalent. If equation (6) is used directly, the error in the \nenergy due to setting all potentials of order d to zero grows quickly with d. For \nthis reason (6) must be normalized if it is going to be used in a truncated modeling \nscheme. A normalization version of equation (2) that corrects for the unequal \nnumber of potentials of different orders is: \n\nVA(w) = L Ibl- 1 \n\n( IAI- 1)-1 \n\nJb(W) \n\nb~A \n\n(7) \n\nThis equation can be inverted to show the surprising result, a weight associated \nwith WA: \n\nJA(W) = In(PA(w\u00bb - (IAI- 1)-1 L In(Pb(w\u00bb \n\n(8) \n\ntEA \n\nb=A-t \n\nFor example, with three attribute values {x, y, z}, the following potentials are de(cid:173)\nfined: \n\nJ{x} = In(p(x)) \nJ{y} = In(p(y)) \nJ{z} = In(p(z)) \n\n( p(xy) ) \nJ{xy} = In p(x)p(y) \n( p(yz) ) \nJ{yz} = In p(y)p(z) \n( p(xz) ) \nJ{xz} = In p(x)p(z) \n\nJ{xyz} = In ( \n\np(xyz) \n\n) \n\nylp(xy)p(yz)p(xz) \n\nFor a given database sample, a potential is activated if all of its defined attribute \nvalues are true for the sample. The weighted sum of all activated potentials recovers \nan approximation of the probability of the database sample. If all potentials of every \norder have been used to create the model, then this approximation is exactly the \nprobability of the sample in the empirical distribution. The correct weighting is \ngiven by equation (7). For example it is easily verified that: \n\nIn(p(xyz)) \n\nJ{xyz} + 1 \n\n(J{xy} + J{xz} + J{yz}) \n\n(2)-1 \n\n2 \n\n(2) -1 \n(2) -1 \n\n+ 0 \n\n(J{x} + J{y} + J{z}). \n\n\f534 \n\nMiller and Goodman \n\nIn(p( xyz)) ~ (2) -1 \n\n1 \n\nThe Gibbs model truncated to second order potentials would estimate the proba(cid:173)\nbility in this example by: \n\n(J{xy} + J{xz} + J{yz}) + \u00b0 (J{x} + J{y} + J{z})' \n\n(2)-1 \n\n~ In Vp(xy)p(yz)p(xz) \n\n4 PROOF OF THE INVERSION FORMULA \n\nTheorem: \nLet T be a finite set. Each element of T will be called an attribute. Each attribute \ncan take on one of a finite set of states called attribute values. A collection of \nattribute values for every element of T is called a configuration w. For all A ~ T \n(including both the empty set A = 0 and the full set A = T), let VA(w) and JA(W) \nbe functions mapping the states of the elements of A to the real numbers. Define \n(r;:) = m!j(( m - n)! . n!) to be \"m choose n.\" \nLet V0(w) = 0, J0(W) = 0, and let VA(W) = JA(w) if IAI = l. \nThen for IAI > 1: \n\nand \n\nL (IAI- 1)-1 . Vb(w) \n\nbCA \n\nIbl=IAI-l \n\n(9) \n\n(10) \n\nare equivalent in that any assignment of VA and J A values for all A ~ T will satisfy \n(9) if and only if they also satisfy (1 0). \n\nProof: \nLet .:7 be any assignment ofthe values JA(W) for all A ~ T. Let V be any assignment \nof all the values VA(W) for all A ~ T. Then clearly (9) maps any assignment .:7 to a \nunique V. We will represent this mapping by the function I, so (9) is abbreviated \nV = 1(.:7). Similarly (10) maps any assignment V to a unique.:7. Equation (10) \nwill be abbreviated .:7 = g(V). The result of Lemma Cl below, applied with the \nvalue 1) set to n, shows that l(g(V)) = V. \nIn Lemma C2 below, it is shown \ng(/(.:7)) = .:7. Therefore the equations (9) and (10) are inverse one-to-one mappings \nand the association of assignments between .:7 and V are identical for the two \nQ.E.D. \nequations. \n\nLemma Cl: \nRather than simply showing l(g(V)) = V, a more general result will be shown. Since \nthe number of potentials of a given order increases exponentially with the order, it \nis useful to approximate the energy of a configuration by defining a maximum order \n1) such that all potentials of greater order are assumed to be zero \n\nJb(W) = 0 \n\n\\:I b such that Ibl > 1). \n\nLet VA(W) be the resulting approximation to the energy VA(W). Let IAI = n. \n\n\fProbability Estimation from a Database Using a Gibbs Energy Model \n\n535 \n\nGiven \n\nJA(W) = VA(W) - L (n - 1)-1 . ~(w) \n\nbCA \n\nIbl=n-l \n\nand the order V approximation to equation (7): \n\n(11) \n\n(12) \n\nthen \n\n1)-1 \n\nVA(w) = L ~ ~ 1 L Jb(W), \n\nv ( \n\ni=l \n\nbCA \nIbT=. \n\nn-( 1) -1 \nVA(W) = L V-I \n\nA \n\nVb(w). \n\nbCA \nIb!=V \n\nNote: \nFor the case V = n, the approximation is exact \nVA(w) = VA(W), \n\nand so f(g(V\u00bb = V is shown. \nThe lemma's result has a simple interpretation. The energy of a configuration is \napproximated by a scaled average of the energies of the configurations of order \nV. Using equation (1) to relate energies to probabilities, shows that the estimated \nprobability is a scaled geometric mean of the order V marginal probabilities. \n\ni=l \n\nb~A \nIbl=\u00b7 \n\n1)-1 \n\nUse equation (11) to substitute Jb(W) out of the equation: \n\nProof: \nWe start with the given equation for VA(W) \nL ~ ~ 1 L Jb(w). \nv ( \nt (~~ :) -1 ~ (~(W) - cCt;;~l (i - 1)-1 . Vc(W\u00bb) \n,E; ~ 1 \n( 1)-1 \n\"( 1)-1 \n-2: ~~1 L 2: (i-1)-1\u00b7Vc(w). \n\nSeparate the term in the first sum where i = V \n\n(\"-1 (1)-1 \n\n) \n~. V.(w) \n\nV.(w) + ~ : ~ 1 \n\nIbl=. \n\nIcl=lbl-l \n\nVA(w) \n\nVA(W) \n\ni=l \n\nb~A cCb,lbl~l \nIbl=. 1c1=lbl-l \n\nBy subtracting VA(W) from both sides using equation (12) and noting the second \nsummation over i has no terms when i = 1 we see that it is sufficient to show \n\nI:(:~;rLV.(W) t(~~;rL L (i-W'\u00b7 V,(w). \n\nbCA \nIhl=. \n\ni=2 \n\nb~A \nIbl=. Icl=lbl-l \n\ncCb \n\ni=l \n\n\f536 \n\nMiller and Goodman \n\nV-I ( \n\n1)-1 \n\nThe right hand side inner double summation counts a given llc(w) once for every b \nsuch that e C b ~ A with i = Ibl = lei + 1. This occurs exactly IAI- lei = n -\ni + 1 \ntimes. Thus \nL ~ ~ 1 L ~(w) = L ~ ~ 1 L n ~ ~ 1 . Vc(w). \nN ow perform a change of variables. Let j = i-Ion the right hand side \n~ (~ - ; ) -1 L ~(w) = ~ (n ~ 1) -1 L n . j \n\nv (1)-1 \n\n\u00b7+1 \n\n. Vc(w). \n\nIcl=i-l \n\n\"CA \n1\"1=. \n\ni=l \n\ni=2 \n\ncCA \n\ni=l \n\nt -\n\n\"CA \n1\"1=. \n\nj=l \n\nJ \n\nJ \n\ncCA \n!cl=j \n\nClearly both sides are identical since \n\nn-t \n\nQ.E.D. \n\nLemma C2: g(/(:1)) = :1 \nLet IAI = n. It is sufficient to show that substituting ~ out of (10) using (9) yields \nan identity: \nJA(W) = VA(w) - L (n _1)-1 \u00b71Ib(w) \n\nbCA,n;tl \nIbl=n-l \n\n( n - 1) -1 \n\nL Ibl- 1 \n\nbCA \n-\n\nJb(W) - L (n -1) L lel- 1 \n\n-1 \n\ncCb \nbCA,n;tl \nIb l=n - l -\n\n(Ibl - 1)-1 \n\nJc(w). \n\nSeparate the term in the first sum for which b = A \nJA(W) = \n\nJA(w) \n\n+ L Ibl- 1 \n\n( n_l)-l \n\nh(w) - L (n - 1) L lel- 1 \n\n-1 \n\nJc(w). \n\nbCA \nb;tA \n\ncCb \n\"CA,n;tl \nIb l=n - l -\n\n(lbl-l)-l \n\nSubtract J A (w) from both sides. The right hand side double sum counts a given \nJc(w) once for every b such that e ~ b C A with Ibl = IAI- 1 = n - 1. This occurs \nIAI- lei = n -\n\nlei times. It is sufficient to show \n\nBoth sides are identical since: \n\n(~ _ 1)-1 \n\nZ - 1 \n\nn -lei \nn-l \n\n( \n\nn-2 \n)\nlel- 1 \n\n-1 \n\nJc(w). \n\ncCA,ctA \n\n(~ _ 2)-1 \n\nZ - 1 \n\nn - z \nn-l \n\nQ.E.D. \n\n\fProbability Estimation from a Database Using a Gibbs Energy Model \n\n537 \n\n5 USING THE INVERSION FORMULA TO SET \n\nNETWORK WEIGHTS \n\nOur method of probability estimation is to first collect empirical frequencies of \npatterns (sub configurations) from the database. (An efficient hash table implemen(cid:173)\ntation of the algorithm is described in (Miller, 1993). The basic idea is to remove \nfrom the database a pattern with low potential whenever there is a hash collision \nwhich prevents a new pattern count from being stored.) Second, interpreting these \nfrequencies as probabilities, we convert each pattern frequency to a potential using \nequation (8). We assume patterns with unknown or uncalculated frequencies have \nzero potential. Low order patterns which never occur are assigned a large negative \npotential (this approximation is needed to model events with zero probability in \nthe empirical distribution). Finally, we calculate the probability of any new pattern \nnot in the training set using the neural network implementation of equations (7) \nand (1). \n\n6 RESULTS \n\nOne way to validate the performance of a probability model is to test its performance \nas a classifier. The probability model is used as a classifier by calculating the \nprobabilities of each unknown class value together with the known attribute values. \nThe most probable combination is then chosen as the predicted class. Used as a \nclassifier the Gibbs model tied or outperformed published results on a variety of \ndatabases. Table 1 outlines results on three datasets taken from the UC Irvine \narchive (Murphy, 1992). The Gibbs model results were collected from the very \nfirst experiment using the algorithm with the datasets. No difficult parameter \nadjustment is necessary to get the algorithm to classify at these rates. The iris \ndatabase has 4 real value attributes. Each attribute was quantized into a decile \nranking for use by the algorithm. \n\n7 CONCLUSION \n\nA new method of extracting a Gibbs probability model from a database has been \npresented. The approach uses the Principle of Inclusion-Exclusion to invert a set of \ncollected statistics into a set of potentials for a Gibbs energy model. A hash table \nimplementation is used to efficiently process database records in order to collect the \nmost important potentials, or weights, which can be stored in the available memory. \nAlthough the model is designed to give accurate probability estimates rather than \nsimply class labels, the model in practice works well as a classifier on a variety of \ndatabases. \n\nAcknowledgements \n\nThis work is funded in part by DARPA and ONR under grant NOOOI4-92-J-1860. \n\n\f538 \n\nMiller and Goodman \n\nDatabase \n\nTable 1: Summary of Classification Results \nA C \n\nTrain Test Trials Gibbs Rate Compare \n\nR \n\nHouse Voting \nIris \nIris \nBreast Cancer \nBreast Cancer \n\n16 \n4 \n4 \n9 \n9 \n\n2 \n3 \n3 \n2 \n2 \n\n335 \n120 \n149 \n599 \n200 \n\n435 \n150 \n150 \n699 \n369 \n\n95% \nn.a. \n98.0% \nn.a. \n93.7% \nA = Attribute count in the database, excluding the class attribute \nC = Class count \nR = Record count \n\n95.3% \n96.3% \n97.1% \n97.3% \n95.7% \n\n50 \n100 \n1000 \n100 \n100 \n\n100 \n30 \n1 \n100 \n169 \n\nTrain = Number of records used to create the energy for one trial \nTest = Number of records tested in a single trial \nTrials = Number of independent train-test trials used to calculate the rate \n\nGibbs Rate = Gibbs energy model classification rate \n\nCompare = Baseline classification result of other methods (Schlimmer, 1987), \n\n(Weiss, 1992),(Zhang, 1992) respectively \n\nReferences \n\nD. Slepian, \"On Maxentropic Discrete Stationary Processes,\" Bell System \nTechnical Journal, 51, pp.629-653, 1972. \n\nG.E. Hinton and T .J. Sejnowski, \"Learning and Relearning in Boltzmann \nMachines,\" in Parallel Distributed Processing, Vol. I., pp.282-317, Cambridge \nMA: MIT Press, 1986. \nR. Kindermann, J .L. Snell, Markov Random Fields and their Applications, \nProvidence, RI: American Mathematical Society, 1980. \n\nJ. W. Miller, \"Building Probabilistic Models from Databases\" California Institute \nof Technology, Ph.D. Thesis 1993. \n\nP. Murphy, and D. Aha, UCI Repository of Machine Learning Databases \n[Machine-readable data repository at ics.uci.edu in directory \n/pub/machine-Iearning-databases]. Irvine, CA: University of California, \nDepartment of Information and Computer Science, 1992. \n\nSchlimmer, J. C., \"Concept Acquisition Through Representational Adjustment\" \nUniversity of California at Irvine, Ph.D. Thesis 1987. \n\nS. Weiss, and I. Kapouleas, \"An Empirical Comparison of Pattern Recognition, \nNeural Nets, and Machine Learning Classification Methods,\" in Proceedings of \nthe 11th International Joint Conference on Artificial Intelligence Vol. 1, \npp.781-787, Los Gatos, CA: Morgan Kaufmann, 1992. \n\nJ. Zhang, \"Selecting Typical Instances in Instance-Based Learning,\" in \nProceedings of the Ninth International Machine Learning Conference Aberdeen, \nScotland, pp.470-479, San Mateo CA: Morgan Kaufmann, 1992. \n\n\f", "award": [], "sourceid": 609, "authors": [{"given_name": "John", "family_name": "Miller", "institution": null}, {"given_name": "Rodney", "family_name": "Goodman", "institution": null}]}