{"title": "Learning Sparse Perceptrons", "book": "Advances in Neural Information Processing Systems", "page_first": 654, "page_last": 660, "abstract": null, "full_text": "Learning Sparse Perceptrons \n\nJeffrey C. Jackson \n\nMathematics & Computer Science Dept. \n\nDuquesne University \n\n600 Forbes Ave \n\nPittsburgh, PA 15282 \n\njackson@mathcs.duq.edu \n\nMark W. Craven \n\nComputer Sciences Dept. \n\nUniversity of Wisconsin-Madison \n\n1210 West Dayton St. \nMadison, WI 53706 \ncraven@cs.wisc.edu \n\nAbstract \n\nWe introduce a new algorithm designed to learn sparse percep(cid:173)\ntrons over input representations which include high-order features. \nOur algorithm, which is based on a hypothesis-boosting method, \nis able to PAC-learn a relatively natural class of target concepts. \nMoreover, the algorithm appears to work well in practice: on a set \nof three problem domains, the algorithm produces classifiers that \nutilize small numbers of features yet exhibit good generalization \nperformance. Perhaps most importantly, our algorithm generates \nconcept descriptions that are easy for humans to understand. \n\n1 \n\nIntrod uction \n\nMulti-layer perceptron (MLP) learning is a powerful method for tasks such as con(cid:173)\ncept classification. However, in many applications, such as those that may involve \nscientific discovery, it is crucial to be able to explain predictions. Multi-layer percep(cid:173)\ntrons are limited in this regard, since their representations are notoriously difficult \nfor humans to understand. We present an approach to learning understandable, \nyet accurate, classifiers. Specifically, our algorithm constructs sparse perceptrons, \ni.e., single-layer perceptrons that have relatively few non-zero weights. Our algo(cid:173)\nrithm for learning sparse perceptrons is based on a new hypothesis boosting algo(cid:173)\nrithm (Freund & Schapire, 1995). Although our algorithm was initially developed \nfrom a learning-theoretic point of view and retains certain theoretical guarantees (it \nPAC-learns the class of sparse perceptrons), it also works well in practice. Our ex(cid:173)\nperiments in a number of real-world domains indicate that our algorithm produces \nperceptrons that are relatively comprehensible, and that exhibit generalization per(cid:173)\nformance comparable to that of backprop-trained MLP's (Rumelhart et al., 1986) \nand better than decision trees learned using C4.5 (Quinlan, 1993). \n\n\fLearning Sparse Perceptrons \n\n655 \n\nWe contend that sparse perceptrons, unlike MLP's, are comprehensible because they \nhave relatively few parameters, and each parameter describes a simple (Le. linear) \nrelationship. As evidence that sparse perceptrons are comprehensible, consider that \nsuch linear functions are commonly used to express domain knowledge in fields such \nas medicine (Spackman, 1988) and molecular biology (Stormo, 1987). \n\n2 Sparse Perceptrons \n\nA perceptron is a weighted threshold over the set of input features and over higher(cid:173)\norder features consisting of functions operating on only a limited number of the \ninput features. Informally, a sparse perceptron is any perceptron that has relatively \nfew non-zero weights. For our later theoretical results we will need a more precise \ndefinition of sparseness which we develop now. Consider a Boolean function I : \n{O, 1 } n -t { -1, + 1 }. Let Ck be the set of all conjunctions of at most k of the inputs \nto I. Ck includes the \"conjunction\" of 0 inputs, which we take as the identically \n1 function. All of the functions in Ck map to {-1,+1}, and every conjunction in \nCk occurs in both a positive sense (+1 represents true) and a negated sense (-1 \nrepresents true). Then the function I is a k-perceptron if there is some integer s \nsuch that I(x) = sign(L::=1 hi(x)), where for all i, hi E Ck, and sign(y) is undefined \nif y = 0 and is y/lyl otherwise. Note that while we have not explicitly shown any \nweights in our definition of a k-perceptron I, integer weights are implicitly present \nin that we allow a particular hi E Ck to appear more than once in the sum defining \nI. In fact, it is often convenient to think of a k-perceptron as a simple linear \ndiscriminant function with integer weights defined over a feature space with O(nk) \nfeatures, one feature for each element of Ck \u2022 \nWe call a given collection of s conjunctions hi E Ck a k-perceptron representation of \nthe corresponding function I, and we call s the size of the representation. We define \nthe size of a given k-perceptron function I as the minimal size of any k-perceptron \nrepresentation of I. An s-sparse k-perceptron is a k-perceptron I such that the size \nof I is at most s. We denote by PI: the set of Boolean functions over {O, 1}n which \ncan be represented as k-perceptrons, and we define Pk = Un Pi:. The subclass of \ns-sparse k-perceptrons is denoted by Pk,/l\" We are also interested in the class P~ \nof k-perceptrons with real-valued weights, at most r of which are non-zero. \n\n3 The Learning Algorithm \n\nIn this section we develop our learning algorithm and prove certain performance \nguarantees. Our algorithm is based on a recent \"hypothesis boosting\" algorithm \nthat we describe after reviewing some basic learning-theory terminology. \n\n3.1 PAC Learning and Hypothesis Boosting \n\nFollowing Valiant (1984), we say that a function class :F (such as Pk for fixed k) \nis (strongly) PAC-learnable if there is an algorithm A and a polynomial function \nPI such that for any positive f and 8, any I E :F (the target junction), and any \nprobability distribution D over the domain of I, with probability at least 1 -\n8, algorithm A(EX(f, D), f, 8) produces a function h (the hypothesis) such that \nPr[PrD[/(x) I- hex)] > f] < 8. The outermost probability is over the random choices \nmade by the EX oracle and any random choices made by A. Here EX(f, D) denotes \nan oracle that, when queried, chooses a vector of input values x with probability \nD and returns the pair (x,/(x)) to A. The learning algorithm A must run in time \nPI (n, s, c 1 , 8-1 ), where n is the length of the input vector to I and s is the size of \n\n\f656 \n\nJ. C. JACKSON, M. W. CRAVEN \n\nAdaBoost \nInput: training set S of m examples of function f, weak learning algorithm WL that \nis (~ - 'Y)-approximate, l' \nAlgorithm: \n\nfor all xES, w(x) +-- l/m \n\n1. T +-- ~ In(m) \n2. \n3. for i = 1 to T do \n4. \n5. \n6. \n7. \n8. \n9. enddo \n\nfor all XES, Di(X) +-- w(x)/ L:l=l w(x). \ninvoke WL on S and distribution Di, producing weak hypothesis hi \n\u20aci +-- L:z.h;(z);oI:/(z) Di(X) \n(3i +-- \u20aci/ (1 - \u20aci) \nfor all XES, if h(x) = f(x) then w(x) +-- w(x) . (3i \n\nOutput: h(x) == sign (L::=l -In((3i) . hi{x)) \n\nFigure 1: The AdaBoost algorithm. \n\nf; the algorithm is charged one unit of time for each call to EX. We sometimes \ncall the function h output by A an \u20ac-approximator (or strong approximator) to f \nwith respect to D. If F is PAC-learnable by an algorithm A that outputs only \nhypotheses in class 1\u00a3 then we say that F is PAC-learnable by 1\u00a3. If F is PAC(cid:173)\nlearnable for \u20ac = 1/2 - 1/'P2(n, s), where'P2 is a polynomial function, then :F is \nweakly PA C-learnable, and the output hypothesis h in this case is called a weak \napproximator. \nOur algorithm for finding sparse perceptrons is, as indicated earlier, based on the \nnotion of hypothesis boosting. The specific boosting algorithm we use (Figure 1) \nis a version of the recent AdaBoost algorithm (Freund & Schapire, 1995). In the \nnext section we apply AdaBoost to \"boost\" a weak learning algorithm for Pk,8 into \na strong learner for Pk,8' AdaBoost is given a set S of m examples of a function \nf : {O,1}n ---+ {-1, +1} and a weak learning algorithm WL which takes \u20ac = ! -\nl' \nfor a given l' b must be bounded by an inverse polynomial in nand s). Adaf300st \nruns for T = In(m)/(2'Y2) stages. At each stage it creates a probability distribution \nDi over the training set and invokes WL to find a weak hypothesis hi with respect \nto Di (note that an example oracle EX(j, Di) can be simulated given Di and S). \nAt the end of the T stages a final hypothesis h is output; this is just a weighted \nthreshold over the weak hypotheses {hi I 1 ~ i ~ T}. If the weak learner succeeds \nin producing a (~-'Y)-approximator at each stage then AdaBoost's final hypothesis \nis guaranteed to be consistent with the training set (Freund & Schapire, 1995). \n\n3.2 PAC-Learning Sparse k-Perceptrons \n\nWe now show that sparse k-perceptrons are PAC learnable by real-weighted k(cid:173)\nperceptrons having relatively few nonzero weights. Specifically, ignoring log factors, \nPk,8 is learnable by P~O(82) for any constant k. We first show that, given a training \nset for any f E Pk,8' we can efficiently find a consistent h E p~( 8 2 )' This consis(cid:173)\ntency algorithm is the basis of the algorithm we later apply to empirical learning \nproblems. We then show how to turn the consistency algorithm into a PAC learning \nalgorithm. Our proof is implicit in somewhat more general work by Freund (1993), \nalthough he did not actually present a learning algorithm for this class or analyze \n\n\fLearning Sparse Perceptrons \n\n657 \n\nthe sample size needed to ensure f-approximation, as we do. Following Freund, we \nbegin our development with the following lemma (Goldmann et al., 1992): \n\nLemma 1 (Goldmann Hastad Razhorov) For I: {0,1}n -+ {-1,+1} and H, \nany set 01 functions with the same domain and range, il I can be represented as \nI(x) = sign(L::=l hi(X\u00bb, where hi E H, then lor any probability distribution D \nover {O, 1}n there is some hi such that PrD[f(x) \u00a5- hi(x)] ~ ~ - 218 ' \nIf we specialize this lemma by taking H = Ck (recall that Ck is the set of conjunc(cid:173)\ntions of at most k input features of f) then this implies that for any I E Pk,8 and \nany probability distribution D over the input features of I there is some hi E Ck \nthat weakly approximates I with respect to D. Therefore, given a training set S \nand distribution D that has nonzero weight only on instances in S, the following \nsimple algorithm is a weak learning algorithm for Pk: exhaustively test each of the \nO(nk) possible conjunctions of at most k features until we find a conjunction that \n\na - 218 )-approximates I with respect to D (we can efficiently compute the approx(cid:173)\nimation of a conjunction hi by summing the values of D over those inputs where hi \nand I agree). Any such conjunction can be returned as the weak hypothesis. The \nabove lemma proves that if I is a k-perceptron then this exhaustive search must \nsucceed at finding such a hypothesis. Therefore, given a training set of m examples \nof any s-sparse k-perceptron I, AdaBoost run with the above weak learner will, af(cid:173)\nter 2s2In(m) stages, produce a hypothesis consistent with the training set. Because \neach stage adds one weak hypothesis to the output hypothesis, the final hypothesis \nwill be a real-weighted k-perceptron with at most 2s2In(m) nonzero weights. \nWe can convert this consistency algorithm to a PAC learning algorithm as follows. \nFirst, given a finite set of functions F, it is straightforward to show the following \n(see, e.g., Haussler, 1988): \n\nLemma 2 Let F be a finite set ollunctions over a domain X. For any function \nlover X, any probability distribution D over X, and any positive f and ~, given a \nset S ofm examples drawn consecutively from EX(f, D), where m ~ f-1(ln~-1 + \nIn IFI), then Pr[3h E F I \"Ix E S f(x) = h(x) & Prv[/(x) \u00a5- h(x)] > f] < ~, where \nthe outer probability is over the random choices made by EX(f,D). \n\nThe consistency algorithm above finds a consistent hypothesis in P~, where r = \n2s2 In(m). Also, based on a result of Bruck (1990), it can be shown that In IP~I = \no (r2 + kr log n). Therefore, ignoring log factors, a randomly-generated training set \nof size O(kS4 If) is sufficient to guarantee that, with high probability, our algorithm \nwill produce an f-approximator for any s-sparse k-perceptron target. In other words, \nthe following is a PAC algorithm for Pk,8: compute sufficiently large (but polynomial \nin the PAC parameters) m, draw m examples from EX(f, D) to create a training \nset, and run the consistency algorithm on this training set. \n\nSo far we have shown that sparse k-perceptrons are learnable by sparse perceptron \nhypotheses (with potentially polynomially-many more weights). \nIn practice, of \ncourse, we expect that many real-world classification tasks cannot be performed \nexactly by sparse perceptrons. In fact, it can be shown that for certain (reasonable) \ndefinitions of \"noisy\" sparse perceptrons (loosely, functions that are approximated \nreasonably well by sparse perceptrons), the class of noisy sparse k-perceptrons is \nstill PAC-learnable. This claim is based on results of Aslam and Decatur (1993), \nwho present a noise-tolerant boosting algorithm. In fact, several different boosting \nalgorithms could be used to learn Pk,s (e.g., Freund, 1993). We have chosen to use \nAdaBoost because it seems to offer significant practical advantages, particularly in \nterms of efficiency. Also, our empirical results to date indicate that our algorithm \n\n\f658 \n\nJ. C. JACKSON, M. W. CRAVEN \n\nworks very well on difficult (presumably \"noisy\") real-world problems. However, \none potential advantage of basing the algorithm on one of these earlier boosters \ninstead of AdaBoost is that the algorithm would then produce a perceptron with \ninteger weights while still maintaining the sparseness guarantee of the AdaBoost(cid:173)\nbased algorithm. \n\n3.3 Practical Considerations \n\nWe turn now to the practical details of our algorithm, which is based on the consis(cid:173)\ntency algorithm above. First, it should be noted that the theory developed above \nworks over discrete input domains (Boolean or nominal-valued features). Thus, in \nthis paper, we consider only tasks with discrete input features. Also, because the \nalgorithm uses exhaustive search over all conjunctions of size k, learning time de(cid:173)\npends exponentially on the choice of k. In this study we to use k = 2 throughout, \nsince this choice results in reasonable learning times. \nAnother implementation concern involves deciding when the learning algorithm \nshould terminate. The consistency algorithm uses the size of the target function \nin calculating the number of boosting stages. Of course, such size information is \nnot available in real-world applications, and in fact, the target function may not be \nexactly representable as a sparse perceptron. In practice, we use cross validation \nto determine an appropriate termination point. To facilitate comprehensibility, we \nalso limit the number of boosting stages to at most the number of weights that \nwould occur in an ordinary perceptron for the task. For similar reasons, we also \nmodify the criteria used to select the weak hypothesis at each stage so that simple \nfeatures are preferred over conjunctive features. In particular, given distribution \nD at some stage j, for each hi E Ck we compute a correlation Ev[/ . hi]. We \nthen mUltiply each high-order feature's correlation by i. The hi with the largest \nresulting correlation serves as the weak hypothesis for stage j. \n\n4 Empirical Evaluation \n\nIn our experiments, we are interested in assessing both the generalization ability \nand the complexity of the hypotheses produced by our algorithm. We compare our \nalgorithm to ordinary perceptrons trained using backpropagation (Rumelhart et al., \n1986), multi-layer perceptrons trained using backpropagation, and decision trees \ninduced using the C4.5 system (Quinlan, 1993). We use C4.5 in our experiments as \na representative of \"symbolic\" learning algorithms. Symbolic algorithms are widely \nbelieved to learn hypotheses that are more comprehensible than neural networks. \nAdditionally, to test the hypothesis that the performance of our algorithm can be \nexplained solely by its use of second-order features, we train ordinary perceptrons \nusing feature sets that include all pairwise conjunctions, as well as the ordinary \nfeatures. To test the hypothesis that the performance of our algorithm can be \nexplained by its use of relatively few weights, we consider ordinary perceptrons \nwhich have been pruned using a variant of the Optimal Brain Damage (OBD) \nalgorithm (Le Cun et al., 1989). In our version of OBD, we train a perceptron until \nthe stopping criteria are met, prune the weight with the smallest salience, and then \niterate the process. We use a validation set to decide when to stop pruning weights. \nFor each training set, we use cross-validation to select the number of hidden units \n(5, 10, 20, 40 or 80) for the MLP's, and the pruning confidence level for the C4.5 \ntrees. We use a validation set to decide when to stop training for the MLP's. \nWe evaluate our algorithm using three real-world domains: the voting data set from \nthe UC-Irvine database; a promoter data set which is a more complex superset of \n\n\fLearning Sparse Perceptrons \n\n659 \n\nT bl 1 11 \n\na e \n\n: es -se accuracy. \n\nt \n\nt \n\nperceptrons \n\nboosting \n91.5% \n\ndomain \nvoting \npromoter 92.7 \ncoding \n72.9 \n\nC4.5 \n89.2% * 92.2% \n* 90.6 \n84.4 \n* 71.6 \n62.6 \n\nmulti-layer ordinary 2nd-order \n\n90.8% \n90.0 \n* 70.7 \n\npruned \n89.2% * 87.6% * \n* \n* \n\n* 88.2 \n* 70.3 \n\n* 88.7 \n* 69.8 \n\nTable 2: Hypothesis complexity (# weights). \n\ndomain \nvoting \npromoters \nprotein coding \n\nperceptrons \n\nboosting multi-layer ordinary 2nd-order pruned \n12 \n59 \n37 \n\n450 \n25764 \n1740 \n\n651 \n2267 \n4270 \n\n30 \n228 \n60 \n\n12 \n41 \n52 \n\nU C-Irvine one; and a data set in which the task is to recognize protein-coding \nregions in DNA (Craven & Shavlik, 1993). We remove the physician-fee-freeze \nfeature from the voting data set to make the problem more difficult. We conduct \nour experiments using a lO-fold cross validation methodology, except for in the \nprotein-coding domain. Because of certain domain-specific characteristics of this \ndata set, we use 4-fold cross-validation for our experiments with it. \nTable 1 reports test-set accuracy for each method on all three domains. We mea(cid:173)\nsure the statistical significance of accuracy differences using a paired, two-tailed \nt-test. The symbol '*' marks results in cases where another algorithm is less ac(cid:173)\ncurate than our boosting algorithm at the p ::; 0.05 level of significance. No other \nalgorithm is significantly better than our boosting method in any of the domains. \nFrom these results we conclude that (1) our algorithm exhibits good generalization \nperformance on number of interesting real-world problems, and (2) the generaliza(cid:173)\ntion performance of our algorithm is not explained solely by its use of second-order \nfeatures, nor is it solely explained by the sparseness of the perceptrons it produces. \nAn interesting open question is whether perceptrons trained with both pruning and \nsecond-order features are able to match the accuracy of our algorithm; we plan to \ninvestigate this question in future work. \nTable 2 reports the average number of weights for all of the perceptrons. For all \nthree problems, our algorithm produces perceptrons with fewer weights than the \nMLP's, the ordinary perceptrons, and the perceptrons with second-order features. \nThe sizes of the OBD-pruned perceptrons and those produced by our algorithm \nare comparable for all three domains. Recall, however, that for all three tasks, \nthe perceptrons learned by our algorithm had significantly better generalization \nperformance than their similar-sized OBD-pruned counterparts. We contend that \nthe sizes of the perceptrons produced by our algorithm are within the bounds of \nwhat humans can readily understand. In the biological literature, for example, linear \ndiscriminant functions are frequently used to communicate domain knowledge about \nsequences of interest. These functions frequently involve more weights than the \nperceptrons produced by our algorithm. We conclude, therefore, that our algorithm \nproduces hypotheses that are not only accurate, but also comprehensible. \nWe believe that the results on the protein-coding domain are especially interesting. \nThe input representation for this problem consists of 15 nominal features represent(cid:173)\ning 15 consecutive bases in a DNA sequence. In the regions of DNA that encode \nproteins (the positive examples in our task), non-overlapping triplets of consecu-\n\n\f660 \n\nJ. C. JACKSON, M. W. eRA VEN \n\ntive bases represent meaningful \"words\" called codons. In previous work (Craven \n& Shavlik, 1993), it has been found that a feature set that explicitly represents \ncodons results in better generalization than a representation of just bases. How(cid:173)\never, we used the bases representation in our experiments in order to investigate the \nability of our algorithm to select the \"right\" second-order features. Interestingly, \nnearly all of the second-order features included in our sparse perceptrons represent \nconjunctions of bases that are in the same codon. This result suggests that our \nalgorithm is especially good at selecting relevant features from large feature sets. \n\n5 Future Work \n\nOur present algorithm has a number of limitations which we plan to address. Two \nareas of current research are generalizing the algorithm for application to problems \nwith real-valued features and developing methods for automatically suggesting high(cid:173)\norder features to be included in our algorithm's feature set. \n\nAcknowledgements \n\nMark Craven was partially supported by ONR grant N00014-93-1-0998. Jeff Jackson \nwas partially supported by NSF grant CCR-9119319. \n\nReferences \nAslam, J. A. & Decatur, S. E. (1993). General bounds on statistical query learning and \nPAC learning with noise via hypothesis boosting. In Proc. of the 34th Annual Annual \nSymposium on Foundations of Computer Science, (pp. 282-291). \nBruck, J . (1990). Harmonic analysis of polynomial threshold functions. SIAM Journal \nof Discrete Mathematics, 3(2):168-177. \nCraven, M. W. & Shavlik, J. W. (1993) . Learning to represent codons: A challenge \nproblem for constructive induction. In Proc. of the 13th International Joint Conf. on \nArtificial Intelligence, (pp. 1319-1324), Chambery, France. \n\nFreund, Y. (1993). Data Filtering and Distribution Modeling Algorithms for Machine \nLearning. PhD thesis, University of California at Santa Cruz. \nFreund, Y. & Schapire, R. E. (1995). A decision-theoretic generalization of on-line learn(cid:173)\ning and an application to boosting. \nComputational Learning Theory. \nGoldmann, M., Hastad, J., & Razborov, A. (1992). Majority gates vs. general weighted \nthreshold gates. In Proc. of the 7th IEEE Conf. on Structure in Complexity Theory. \n\nIn Proc. of the ~nd Annual European Conf. on \n\nHaussler, D. (1988). Quantifying inductive bias: AI learning algorithms and Valiant's \nlearning framework. Artificial Intelligence, (pp. 177-221). \n\nLe Cun, Y., Denker, J. S., & Solla, S. A. (1989). Optimal brain damage. In Touretzky, \nD., editor, Advances in Neural Information Processing Systems (volume ~). \nQuinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann. \nRumelhart, D., Hinton, G., & Williams, R. (1986). Learning internal representations \nby error propagation. In Rumelhart, D. & McClelland, J., editors, Parallel Distributed \nProcessing: Explorations in the microstructure of cognition. Volume 1. MIT Press. \nSpackman, K. A. (1988). Learning categorical decision criteria. In Proc. of the 5th \nInternational Conf. on Machine Learning, (pp. 36-46), Ann Arbor, MI. \nStormo, G. (1987). Identifying coding sequences. In Bishop, M. J. & Rawlings, C. J., \neditors, Nucleic Acid and Protein Sequence Analysis: A Practical Approach. IRL Press. \n\nValiant,1. G. (1984). A theory of the learnable. Comm. of the ACM, 27(11):1134-1142. \n\n\f", "award": [], "sourceid": 1076, "authors": [{"given_name": "Jeffrey", "family_name": "Jackson", "institution": null}, {"given_name": "Mark", "family_name": "Craven", "institution": null}]}