{"title": "Data-Dependent Structural Risk Minimization for Perceptron Decision Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 336, "page_last": 342, "abstract": "", "full_text": "On Parallel Versus Serial Processing: \n\nA Computational Study of Visual Search \n\nEyal Cohen \n\nDepartment of Psychology \n\nTel-Aviv University Tel Aviv 69978, Israel \n\neyalc@devil. tau .ac .il \n\nEytan Ruppin \n\nDepartments of Computer Science & Physiology \n\nTel-Aviv University Tel Aviv 69978, Israel \n\nruppin@math.tau .ac.il \n\nAbstract \n\nA novel neural network model of pre-attention processing in visual(cid:173)\nsearch tasks is presented. Using displays of line orientations taken \nfrom Wolfe's experiments [1992], we study the hypothesis that the \ndistinction between parallel versus serial processes arises from the \navailability of global information in the internal representations of \nthe visual scene. The model operates in two phases. First, the \nvisual displays are compressed via principal-component-analysis. \nSecond, the compressed data is processed by a target detector mod(cid:173)\nule in order to identify the existence of a target in the display. Our \nmain finding is that targets in displays which were found exper(cid:173)\nimentally to be processed in parallel can be detected by the sys(cid:173)\ntem, while targets in experimentally-serial displays cannot . This \nfundamental difference is explained via variance analysis of the \ncompressed representations, providing a numerical criterion distin(cid:173)\nguishing parallel from serial displays. Our model yields a mapping \nof response-time slopes that is similar to Duncan and Humphreys's \n\"search surface\" [1989], providing an explicit formulation of their \nintuitive notion of feature similarity. It presents a neural realiza(cid:173)\ntion of the processing that may underlie the classical metaphorical \nexplanations of visual search. \n\n\fOn Parallel versus Serial Processing: A Computational Study a/Visual Search \n\n11 \n\n1 \n\nIntroduction \n\nThis paper presents a neural-model of pre-attentive visual processing. The model \nexplains why certain displays can be processed very fast, \"in parallel\" , while others \nrequire slower, \"serial\" processing, in subsequent attentional systems. Our approach \nstems from the observation that the visual environment is overflowing with diverse \ninformation, but the biological information-processing systems analyzing it have \na limited capacity [1]. This apparent mismatch suggests that data compression \nshould be performed at an early stage of perception, and that via an accompa(cid:173)\nnying process of dimension reduction, only a few essential features of the visual \ndisplay should be retained. We propose that only parallel displays incorporate \nglobal features that enable fast target detection, and hence they can be processed \npre-attentively, with all items (target and dis tractors) examined at once. On the \nother hand, in serial displays' representations, global information is obscure and \ntarget detection requires a serial, attentional scan of local features across the dis(cid:173)\nplay. Using principal-component-analysis (peA), our main goal is to demonstrate \nthat neural systems employing compressed, dimensionally reduced representations \nof the visual information can successfully process only parallel displays and not se(cid:173)\nrial ones. The sourCe of this difference will be explained via variance analysis of the \ndisplays' projections on the principal axes. \n\nThe modeling of visual attention in cognitive psychology involves the use of \nmetaphors, e.g., Posner's beam of attention [2]. A visual attention system of a \nsurviving organism must supply fast answers to burning issues such as detecting \na target in the visual field and characterizing its primary features. An attentional \nsystem employing a constant-speed beam of attention [3] probably cannot perform \nsuch tasks fast enough and a pre-attentive system is required. Treisman's feature \nintegration theory (FIT) describes such a system [4]. According to FIT, features \nof separate dimensions (shape, color, orientation) are first coded pre-attentively in \na locations map and in separate feature maps, each map representing the values of \na particular dimension. Then, in the second stage, attention \"glues\" the features \ntogether conjoining them into objects at their specified locations. This hypothesis \nwas supported using the visual-search paradigm [4], in which subjects are asked \nto detect a target within an array of distractors, which differ on given physical di(cid:173)\nmensions such as color, shape or orientation. As long as the target is significantly \ndifferent from the distractors in one dimension, the reaction time (RT) is short and \nshows almost no dependence on the number of distractors (low RT slope). This \nresult suggests that in this case the target is detected pre-attentively, in parallel. \nHowever, if the target and distractors are similar, or the target specifications are \nmore complex, reaction time grows considerably as a function of the number of \ndistractors [5, 6], suggesting that the displays' items are scanned serially using an \nattentional process. \nFIT and other related cognitive models of visual search are formulated on the con(cid:173)\nceptual level and do not offer a detailed description of the processes involved in \ntransforming the visual scene from an ordered set of data points into given values \nin specified feature maps. This paper presents a novel computational explanation \nof the source of the distinction between parallel and serial processing, progressing \nfrom general metaphorical terms to a neural network realization. Interestingly, we \nalso come out with a computational interpretation of some of these metaphorical \nterms, such as feature similarity. \n\n\f12 \n\n2 The Model \n\nE. Cohen and E. Ruppin \n\nWe focus our study on visual-search experiments of line orientations performed by \nWolfe et. al. [7], using three set-sizes composed of 4, 8 and 12 items. The number of \nitems equals the number of dis tractors + target in target displays, and in non-target \ndisplays the target was replaced by another distractor, keeping a constant set-size. \nFive experimental conditions were simulated: (A) - a 20 degrees tilted target among \nvertical distractors (homogeneous background). (B) - a vertical target among 20 \ndegrees tilted distractors (homogeneous background). (C) - a vertical target among \nheterogeneous background ( a mixture of lines with \u00b120, \u00b140 , \u00b160 , \u00b180 degrees \norientations). (E) - a vertical target among two flanking distractor orientations (at \n\u00b120 degrees), and (G) - a vertical target among two flanking distractor orientations \n(\u00b140 degrees). The response times (RT) as a function of the set-size measured by \n[7] show that type A, Band G displays are scanned in a parallel \nWolfe et. al. \nmanner (1.2, 1.8,4.8 msec/item for the RT slopes), while type C and E displays are \nscanned serially (19.7,17.5 msec/item). The input displays of our system were pre(cid:173)\npared following Wolfe's prescription: Nine images of the basic line orientations were \nproduced as nine matrices of gray-level values. Displays for the various conditions \nof Wolfe's experiments were produced by randomly assigning these matrices into \na 4x4 array, yielding 128x100 display-matrices that were transformed into 12800 \ndisplay-vectors. A total number of 2400 displays were produced in 30 groups (80 \ndisplays in each group): 5 conditions (A, B, C, E, G ) x target/non-target x 3 \nset-sizes (4,8, 12). \n\nOur model is composed of two neural network modules connected in sequence as \nillustrated in Figure 1: a peA module which compresses the visual data into a set \nof principal axes, and a Target Detector (TD) module. The latter module uses the \ncompressed data obtained by the former module to detect a target within an array \nof distractors. The system is presented with line-orientation displays as described \nabove. \n\nNO\u00b7TARGET =\u00b71 TARGET-I \n\nTn [JUTPUT LAYER (I UN IT ) - - - - - - - , \n\nTARGET \nDETECTOR \nMODULE \n(11)) \n\nTn INrnRMEDIATE LAYER (12 UNITS) \n\nPeA O~=~ LAYER J DATA \n\nCOMPRESSION \n\n--..;;:::~~~ \n\nINPUT LAYER (12Il00 UNITS) \n\nMODULE \n(PeA) \n\n-\n\n_ \n\n/ \n\nDISPLAY \n\nt \n\nt \n\n/--- --\n\nFigure 1: General architecture of the model \n\nFor the PCA module we use the neural network proposed by Sanger, with the \nconnections' values updated in accordance with his Generalized Hebbian Algorithm \n(GHA) [8]. The outputs of the trained system are the projections of the display(cid:173)\nvectors along the first few principal axes, ordered with respect to their eigenvalue \nmagnitudes. Compressing the data is achieved by choosing outputs from the first \n\n\fOn Parallel versus Serial Processing: A Computational Study o/Visual Search \n\n13 \n\nfew neurons (maximal variance and minimal information loss). Target detection in \nour system is performed by a feed-forward (FF) 3-layered network, trained via a \nstandard back-propagation algorithm in a supervised-learning manner. The input \nlayer of the FF network is composed of the first eight output neurons of the peA \nmodule. The transfer function used in the intermediate and output layers is the \nhyperbolic tangent function. \n\n3 Results \n\n3.1 Target Detection \n\nThe performance of the system was examined in two simulation experiments. In \nthe first, the peA module was trained only with \"parallel\" task displays, and in the \nsecond, only with \"serial\" task displays. There is an inherent difference in the ability \nof the model to detect targets in parallel versus serial displays . In parallel task \nconditions (A, B, G) the target detector module learns the task after a comparatively \nsmall number (800 to 2000) of epochs, reaching performance level of almost 100%. \nHowever, the target detector module is not capable of learning to detect a target \nin serial displays (e, E conditions) . Interestingly, these results hold (1) whether \nthe preceding peA module was trained to perform data compression using parallel \ntask displays or serial ones, (2) whether the target detector was a linear simple \nperceptron, or the more powerful, non-linear network depicted in Figure 1, and (3) \nwhether the full set of 144 principal axes (with non-zero eigenvalues) was used. \n\n3.2 \n\nInformation Span \n\nTo analyze the differences between parallel and serial tasks we examined the eigen(cid:173)\nvalues obtained from the peA of the training-set displays. The eigenvalues of \ncondition B (parallel) displays in 4 and 12 set-sizes and of condition e (serial-task) \ndisplays are presented in Figure 2. Each training set contains a mixture of target \nand non-target displays. \n\n(a) \n\n40 \n\n35 \n\nPARALLEL \n\nl!J \nII> \n\"'I;l \n\n+4 ITEMS \n\no 12 ITEMS \n\n30 ~ \n\n25 \n\n~ \n~ \n\nw \n~20 \n~ 15 \nw \n\n10 \n\nSERIAL \n\n+4 ITEMS \n\no 12 ITEMS \n\n(b) \n\n40 \n\n35 \n\n30 \n\n25 \n\nw \n~20 \n~ 15 \nw \n\n10 \n\n5 ~ 5 \n\n0 \n\n0 \n\n-\n\n-5 \n\n0 \n\n10 \n\n40 \nNo. of PRINCIPAL AXIS \n\n20 \n\n30 \n\n-5 \n\n0 \n\n10 \n\n40 \nNo. of PRINCIPAL AXIS \n\n20 \n\n30 \n\nFigure 2: Eigenvalues spectrum of displays with different set-sizes, for parallel and \nserial tasks. Due to the sparseness of the displays (a few black lines on white \nbackground), it takes only 31 principal axes to describe the parallel training-set in \nfull (see fig 2a. Note that the remaining axes have zero eigenvalues, indicating that \nthey contain no additional information.), and 144 axes for the serial set (only the \nfirst 50 axes are shown in fig 2b). \n\n\f14 \n\nE. Cohen and E. Ruppin \n\nAs evident, the eigenvalues distributions of the two display types are fundamentally \ndifferent: in the parallel task, most of the eigenvalues \"mass\" is concentrated in the \nfirst few (15) principal axes, testifying that indeed, the dimension of the parallel \ndisplays space is quite confined. But for the serial task, the eigenvalues are dis(cid:173)\ntributed almost uniformly over 144 axes. This inherent difference is independent of \nset-size: 4 and 12-item displays have practically the same eigenvalue spectra. \n\n3.3 Variance Analysis \n\nThe target detector inputs are the projections of the display-vectors along the first \nfew principal axes. Thus, some insight to the source of the difference between \nparallel and serial tasks can be gained performing a variance analysis on these \nprojections. The five different task conditions were analyzed separately, taking a \ngroup of 85 target displays and a group of 85 non-target displays for each set-size. \nTwo types of variances were calculated for the projections on the 5th principal axis: \nThe \"within groups\" variance, which is a measure of the statistical noise within \neach group of 85 displays, and the \"between groups\" variance, which measures the \nseparation between target and non-target groups of displays for each set-size. These \nvariances were averaged for each task (condition), over all set-sizes. The resulting \nratios Q of within-groups to between-groups standard deviations are: QA = 0.0259, \nQB = 0.0587 ,and Qa = 0.0114 for parallel displays (A, B, G), and QE = 0.2125 \nQc = 0.771 for serial ones (E, C). \nAs evident, for parallel task displays the Q values are smaller by an order of mag(cid:173)\nnitude compared with the serial displays, indicating a better separation between \ntarget and non-target displays in parallel tasks. Moreover, using Q as a criterion \nfor parallel/serial distinction one can predict that displays with Q < < 1 will be \nprocessed in parallel, and serially otherwise, in accordance with the experimental \nresponse time (RT) slopes measured by Wolfe et. al. [7]. This differences are further \ndemonstrated in Figure 3, depicting projections of display-vectors on the sub-space \nspanned by the 5, 6 and 7th principal axes. Clearly, for the parallel task (condition \nB), the PCA representations of the target-displays (plus signs) are separated from \nnon-target representations (circles), while for serial displays (condition C) there is \nno such separation. It should be emphasized that there is no other principal axis \nalong which such a separation is manifested for serial displays. \n\n-11106 \n\n-1 un \n\n.11615 \n\n-11025 \n\n_1163 \n\n'.II , .. \n\" \n\n.,0' \n\nTil \n\n.. . + ++ \n. \n\n.+ \n\n+ \n\no \n\no \n\no \n\n0 \n\no \n\n7.&12 \n\n'.7 \n\n1.1186 \n\nINIIS \n\n11166 , . 18846 , . \n\n\"'AXIS \n\n71hAXIS \n\n110\" \n\n::::~ \n\n_1157 \n\n-11M \n\n~ : . , Hill \n-'.1' \n\n-'181 \n\n_1182 \n'10 \n\n'07 \n\n,.II \n\n, .. .. \n\n\u2022 10~ \n\n, ow \n\n'~AXIS \n\n1\"1 \n\no \n\no \n\n. l \n\u2022 0 \n+0 o \n\n. 0-\n\no \n\no \n\n1114 1113 1.1e2 , . , , . \nno AXIS \n\n1.1'11 \n\n'.71 1 iTT 1.178 \n\n1.175 1 114 \n\n.,. \n\nFigure 3: Projections of display-vectors on the sub-space spanned by the 5, 6 and \n7th principal axes. Plus signs and circles denote target and non-target display(cid:173)\nvectors respectively, (a) for a parallel task (condition B), and (b) for a serial task \n(condition C). Set-size is 8 items. \n\n\fOn Parallel versus Serial Processing: A Computational Study o/Visual Search \n\n15 \n\nWhile Treisman and her co-workers view the distinction between parallel and se(cid:173)\nrial tasks as a fundamental one, Duncan and Humphreys [5] claim that there is \nno sharp distinction between them, and that search efficiency varies continuously \nacross tasks and conditions. The determining factors according to Duncan and \nHumphreys are the similarities between the target and the non-targets (T-N sim(cid:173)\nilarities) and the similarities between the non-targets themselves (N-N similarity). \nDisplays with homogeneous background (high N-N similarity) and a target which is \nsignificantly different from the distractors (low T-N similarity) will exhibit parallel, \nlow RT slopes, and vice versa. This claim was illustrated by them using a qualitative \n\"search surface\" description as shown in figure 4a. Based on results from our vari(cid:173)\nance analysis, we can now examine this claim quantitatively: We have constructed \na \"search surface\", using actual numerical data of RT slopes from Wolfe's exper(cid:173)\niments, replacing the N-N similarity axis by its mathematical manifestation, the \nwithin-groups standard deviation, and N-T similarity by between-groups standard \ndeviation 1. The resulting surface (Figure 4b) is qualitatively similar to Duncan and \nHumphreys's. This interesting result testifies that the PCA representation succeeds \nin producing a viable realization of such intuitive terms as inputs similarity, and is \ncompatible with the way we perceive the world in visual search tasks. \n\n(b) \n\nSEARCH SURFACE \n\n(a) \n\no \n\nCIo-..... ~:..-4.:0,........::~\"\"\"\"\"'\" \n1- _.-...-_ \n\nl.rgeI-.-Jargel \nIImllarll), \n\nFigun J. The seatcllaurface. \n\nFigure 4: RT rates versus: (a) Input similarities (the search surface, reprinted from \nDuncan and Humphreys, 1989). (b) Standard deviations (within and between) of \nthe PCA variance analysis. The asterisks denote Wolfe's experimental data. \n\n4 Summary \n\nIn this work we present a two-component neural network model of pre-attentional \nvisual processing. The model has been applied to the visual search paradigm per(cid:173)\nformed by Wolfe et. al. Our main finding is that when global-feature compression \nis applied to visual displays, there is an inherent difference between the representa(cid:173)\ntions of serial and parallel-task displays: The neural network studied in this paper \nhas succeeded in detecting a target among distractors only for displays that were \nexperimentally found to be processed in parallel. Based on the outcome of the \n\n1 In general, each principal axis contains information from different features, which may \nmask the information concerning the existence of a target. Hence, the first principal axis \nmay not be the best choice for a discrimination task. In our simulations, the 5th axis \nfor example, was primarily dedicated to target information, and was hence used for the \nvariance analysis (obviously, the neural network uses information from all the first eight \nprincipal axes). \n\n\f16 \n\nE. Cohen andE. Ruppin \n\nvariance analysis performed on the PCA representations of the visual displays, we \npresent a quantitative criterion enabling one to distinguish between serial and par(cid:173)\nallel displays. Furthermore, the resulting 'search-surface' generated by the PCA \ncomponents is in close correspondence with the metaphorical description of Duncan \nand Humphreys. \n\nThe network demonstrates an interesting generalization ability: Naturally, it can \nlearn to detect a target in parallel displays from examples of such displays. However, \nit can also learn to perform this task from examples of serial displays only! On the \nother hand, we find that it is impossible to learn serial tasks, irrespective of the \ncombination of parallel and serial displays that are presented to the network during \nthe training phase. This generalization ability is manifested not only during the \nlearning phase, but also during the performance phase; displays belonging to the \nsame task have a similar eigenvalue spectrum, irrespective of the actual set-size of \nthe displays, and this result holds true for parallel as well as for serial displays. \n\nThe role of PCA in perception was previously investigated by Cottrell [9], designing \na neural network which performed tasks as face identification and gender discrim(cid:173)\nination. One might argue that PCA, being a global component analysis is not \ncompatible with the existence of local feature detectors (e.g. orientation detectors) \nin the cortex. Our work is in line with recent proposals [10J that there exist two \npathways for sensory input processing: A fast sub-cortical pathway that contains \nlimited information, and a slow cortical pathway which is capable of providing richer \nrepresentations of the stimuli. Given this assumption this paper has presented the \nfirst neural realization of the processing that may underline the classical metaphor(cid:173)\nical explanations involved in visual search. \n\nReferences \n[1] J. K. Tsotsos. Analyzing vision at the complexity level. Behavioral and Brain \n\nSciences, 13:423-469, 1990. \n\n[2J M. I. Posner, C. R. Snyder, and B. J. Davidson. Attention and the detection \n\nof signals. Journal of Experimental Psychology: General, 109:160-174, 1980. \n\n[3J Y. Tsal. Movement of attention across the visual field. Journal of Experimental \n\nPsychology: Human Perception and Performance, 9:523-530, 1983. \n\n[4] A. Treisman and G. Gelade. A feature integration theory of attention. Cognitive \n\nPsychology, 12:97-136,1980. \n\n[5] J. Duncan and G. Humphreys. Visual search and stimulus similarity. Psycho(cid:173)\n\nlogical Review, 96:433-458, 1989. \n\n[6] A. Treisman and S. Gormican. Feature analysis in early vision: Evidence from \n\nsearch assymetries. Psychological Review, 95:15-48, 1988. \n\n[7] J . M. Wolfe, S. R. Friedman-Hill, M. I. Stewart, and K. M. O'Connell. The \nrole of categorization in visual search for orientation. Journal of Experimental \nPsychology: Human Perception and Performance, 18:34-49, 1992. \n\n[8] T. D. Sanger. Optimal unsupervised learning in a single-layer linear feedfor(cid:173)\n\nward neural network. Neural Network, 2:459-473, 1989. \n\n[9] G. W. Cottrell. Extracting features from faces using compression networks: \nFace, identity, emotion and gender recognition using holons. Proceedings of the \n1990 Connectionist Models Summer School, pages 328-337, 1990. \n\n[10] J. L. Armony, D. Servan-Schreiber, J . D. Cohen, and J. E. LeDoux. Computa(cid:173)\n\ntional modeling of emotion: exploration through the anatomy and physiology \nof fear conditioning. Trends in Cognitive Sciences, 1(1):28-34, 1997. \n\n\fData-Dependent Structural Risk \n\nMinimisation for Perceptron Decision \n\nTrees \n\nJohn Shawe-Taylor \n\nDept of Computer Science \n\nRoyal Holloway, University of London \n\nEgham, Surrey TW20 OEX, UK \n\nEmail: jst@dcs.rhbnc.ac.uk \n\nN ello Cristianini \n\nDept of Engineering Mathematics \n\nUniversity of Bristol \nBristol BS8 ITR, UK \n\nEmail: nello.cristianini@bristol.ac. uk \n\nAbstract \n\nPerceptron Decision 'frees (also known as Linear Machine DTs, \netc.) are analysed in order that data-dependent Structural Risk \nMinimisation can be applied. Data-dependent analysis is per(cid:173)\nformed which indicates that choosing the maximal margin hyper(cid:173)\nplanes at the decision nodes will improve the generalization. The \nanalysis uses a novel technique to bound the generalization error in \nterms of the margins at individual nodes. Experiments performed \non real data sets confirm the validity of the approach. \n\n1 \n\nIntroduction \n\nNeural network researchers have traditionally tackled classification problems byas(cid:173)\nsembling perceptron or sigmoid nodes into feedforward neural networks. In this \npaper we consider a less common approach where the perceptrons are used as deci(cid:173)\nsion nodes in a decision tree structure. The approach has the advantage that more \nefficient heuristic algorithms exist for these structures, while the advantages of in(cid:173)\nherent parallelism are if anything greater as all the perceptrons can be evaluated in \nparallel, with the path through the tree determined in a very fast post-processing \nphase. \nClassical Decision 'frees (DTs), like the ones produced by popular packages as \nCART [5] or C4.5 [9], partition the input space by means ofaxis-parallel hyperplanes \n(one at each internal node), hence inducing categories which are represented by \n(axis-parallel) hyperrectangles in such a space. \nA natural extension of that hypothesis space is obtained by associating to each \ninternal node hyperplanes in general position, hence partitioning the input space \nby means of polygonal (polyhedral) categories. \n\n\fData-Dependent SRMfor Perceptron Decision Trees \n\n337 \n\nThis approach has been pursued by many researchers, often with different motiva(cid:173)\ntions, and hE.nce the resulting hypothesis space has been given a number of different \nnames: multivariate DTs [6], oblique DTs [8], or DTs using linear combinations of \nthe attributes [5], Linear Machine DTs, Neural Decision Trees [12], Perceptron Trees \n[13], etc. \nWe will call them Perceptron Decision Trees (PDTs), as they can be regarded as \nbinary trees having a simple perceptron associated to each decision node. \nDifferent algorithms for Top-Down induction of PDTs from data have been pro(cid:173)\nposed, based on different principles, [10], [5], [8], \nExperimental study of learning by means of PDTs indicates that their performances \nare sometimes better than those of traditional decision trees in terms of generaliza(cid:173)\ntion error, and usually much better in terms of tree-size [8], [6], but on some data \nset PDTs can be outperformed by normal DTs. \nWe investigate an alternative strategy for improving the generalization of these \nstructures, namely placing maximal margin hyperplanes at the decision nodes. By \nuse of a novel analysis we are able to demonstrate that improved generalization \nbounds can be obtained for this approach. Experiments confirm that such a method \ndelivers more accurate trees in all tested databases. \n\n2 Generalized Decision Trees \n\nDefinition 2.1 Generalized Deci.ion Tree. (GDT). \nGiven a space X and a set of boolean functions \n~ = {/ : X -+ {O, I}}, the class GDT(~) of Generalized Decision Trees over ~ are \nfunctions which can be implemented using a binary tree where each internal node \nis labeled with an element of ~, and each leaf is labeled with either 1 or O. \nTo evaluate a particular tree T on input z EX, All the boolean functions associated \nto the nodes are assigned the same argument z EX, which is the argument of T( z). \nThe values assumed by them determine a unique path from the root to a leaf: at \neach internal node the left (respectively right) edge to a child is taken if the output \nof the function associated to that internal node is 0 (respectively 1). The value of \nthe function at the assignment of a z E X is the value associated to the leaf reached. \nWe say that input z reaches a node of the tree, if that node is on the evaluation \npath for z. \n\nIn the following, the nodu are the internal nodes of the binary tree, and the leave. \nare its external ones. \nExamples. \n\n\u2022 Given X = {O, I}\", a Boolean Deci6ion Tree (BDT) is a GDT over \n\nThis kind of decision trees defined on a continuous space are the output of \ncommon algorithms like C4.5 and CART, and we will call them - for short \n- CDTs. \n\n\u2022 Given X = lR\", a Perceptron Deci.ion Tree (PDT) is a GDT over \n\n~PDT = {wT x : W E lR\"+1}, \n\nwhere we have assumed that the inputs have been augmented with a coor(cid:173)\ndinate of constant value, hence implementing a thresholded perceptron. \n\n\u2022 Given X = lR\", a C-I.5-like Deci.ion Tree (CDT) is a GDT over \n\n~BDT = U : \"(x) = Xi, \"Ix E X} \n~CDT = U\" : \",,(x) = 1 \u00a2:> z, > 8} \n\n\f338 \n\n1. Shawe-Taylor and N. Cristianini \n\n3 Data-dependent SRM \n\nWe begin with the definition of the fat-shattering dimension, which was first intro(cid:173)\nduced in [7], and has been used for several problems in learning since [1, 4, 2, 3]. \n\nDefinition 3.1 Let F be a ,et of real valued functiom. We ,ay that a ,et of point. \nX u1-shattered by F relative to r = (r.).ex if there are real number, r. indezed \nby z E X ,uch that for all binary vector' b indezed by X, there u a function I\" E F \n,atufying \n\n~ (z) { ~ r. + 1 \nI\" \n\n~ r. -1 otheMDue. \n\nif b. = .1 \n\nThe fat shattering dimension fat:F of the ,et F i, a function from the po,itive real \nnumber, to the integer' which map' a value 1 to the ,ize of the largut 1-,hattered \n,et, if thi' i, finite, or infinity otherwi6e. \n\nAs an example which will be relevant to the subsequent analysis consider the class: \n\nJ=nn = {z -+ (w, z) + 8: IIwl! = 1}. \n\nWe quote the following result from [11]. \nCorollary 3.2 [11} Let J=nn be reltricted to point' in a ball of n dimemiom of \nradiu, R about the origin and with thre,hold8 181 ~ R. Then \nfat~ (1) ~ min{9R2 /12, n + I} + 1. \n\nThe following theorem bounds the generalisation of a classifier in terms of the \nfat shattering dimension rather than the usual Vapnik-Chervonenkis or Pseudo \ndimension. \nLet T9 denote the threshold function at 8: T9: 1R -+ {O,I}, T9(a) = 1 iff a> 8. For \na class offunctions F, T9(F) = {T9(/): IE F}. \nTheorem 3.3 [11} Comider a real valued function dOl, F having fat ,hattering \nfunction bounded above by the function &lat : 1R -+ N which i, continuOtU from \nthe right. Fi:D 8 E 1R. If a learner correctly cIOl,ifie, m independently generated \nezample, \u2022 with h = T9(/) E T9(F) ,uch that er.(h) = 0 and 1 = min I/(z,) - 81, \nthen with confidence 1 - i the ezpected error of h u bounded from above by \n\ne(m,k,6) = ! (kiog (8~m) log(32m) + log (8;a)) , \n\nwhere k = &lath/8). \nThe importance of this theorem is that it can be used to explain how a classifier \ncan give better generalisation than would be predicted by a classical analysis of its \nVC dimension. Essentially expanding the margin performs an automatic capacity \ncontrol for function classes with small fat shattering dimensions. The theorem shows \nthat when a large margin is achieved it is as if we were working in a lower VC class. \nWe should stress that in general the bounds obtained should be better for cases \nwhere a large margin is observed, but that a priori there is no guarantee that such \na margin will occur. Therefore a priori only the classical VC bound can be used. In \nview of corresponding lower bounds on the generalisation error in terms of the VC \ndimension, the a posteriori bounds depend on a favourable probability distribution \nmaking the actual learning task easier. Hence, the result will only be useful if \nthe distribution is favourable or at least not adversarial. In this sense the result \nis a distribution dependent result, despite not being distribution dependent in the \n\n\fData-Dependent SRMfor Perceptron Decision Trees \n\n339 \n\ntraditional sense that assumptions about the distribution have had to be made in \nits derivation. The benign behaviour of the distribution is automatically estimated \nin the learning process. \nIn order to perform a similar analysis for perceptron decision trees we will consider \nthe set of margins obtained at each of the nodes, bounding the generalization as a \nfunction of these values. \n\n4 Generalisation analysis of the Tree Class \n\nIt turns out that bounding the fat shattering dimension of PDT's viewed as real \nfunction classifiers is difficult. We will therefore do a direct generalization analysis \nmimicking the proof of Theorem 3.3 but taking into account the margins at each of \nthe decision nodes in the tree. \nDefinition 4.1 Let (X, d) be a {p,eudo-} metric 'pace, let A be a ,ub,et of X and \nE > O. A ,et B ~ X i, an E-cover for A if, for every a E A, there eNtI b E B ,uch \nthat d(a,b) < E. The E-covering number of A, A'd(E,A), is the minimal cardinality \nof an E-cover for A (if there is no ,uch finite cover then it i, defined to be (0). \nWe write A'(E,:F, x) for the E-covering number of:F with respect to the lao pseudo(cid:173)\nmetric measuring the maximum discrepancy on the sample x. These numbers are \nbounded in the following Lemma. \nLemma 4.2 (.Alon et al. [1]) Let:F be a cla.s, of junction, X -+ [0,1] and P a \ndistribution over X. Choo,e 0 < E < 1 and let d = fat:F(E/4). Then \n\nE (A'(E,:F, x\u00bb ~ 2 \\ -;;-\n\n(4m)dlos(2em/(cU\u00bb \n\n, \n\nwhere the ezpectation E i, taken w.r.t. a ,ample x E xm drawn according to pm. \nCorollary 4.3 [11} Let :F be a cla\" of junctiom X -+ [a, b] and P a distribution \nover X. Choo,e 0 < E < 1 and let d = fat:F(E/4). Then \n\nE (A'(E,:F, x\u00bb ~ 2 \n\n(\n\n4m(b _ a)2)dlos(2em(\"-Cl)/(cU\u00bb \n\nE2 \n\n' \n\nwhere the ezpectation E is over ,amples x E xm drawn according to pm. \nWe are now in a position to tackle the main lemma which bounds the probability \nover a double sample that the first half has lero error and the second error greater \nthan an appropriate E. Here, error is interpreted as being differently classified at \nthe output of tree. In order to simplify the notation in the following lemma we \nassume that the decision tree has K nodes. We also denote fat:Flin (-y) by fat(-y) to \nsimplify the notation. \nLemma 4.4 Let T be a perceptron decision tree with K decuion node, with margim \n'11 , '12 , \u2022\u2022\u2022 ,'1K at the decision nodes. If it ha.s correctly cla.s,ified m labelled ezamples \ngenerated independently according to the unknown (but jized) distribution P, then \nwe can bound the following probability to be Ie\" than ~, \n\np2m { xy: 3 a tree T : T correctly cla.s,ifie, x, \n\nfraction of y mi,cla\"ified > E( m, K,~) } < ~, \n\nwhere E(m,K,~) = !(Dlog(4m) + log ~). \nwhere D = E~1 kslog(4em/k.) and k, = fat(-y./8). \n\n\f340 \n\n1. Shawe-Taylor and N. Cristianini \n\nProof: Using the standard permutation argument, we may fix a sequence xy and \nbound the probability under the uniform distribution on swapping permutations \nthat the sequence satisfies the condition stated. We consider generating minimal \n'YI&/2-covers B!y for each value of Ie, where \"11& = min{'Y' : fath' /8) :5 Ie}. Suppose \nthat for node i oCthe tree the margin 'Yi of the hyperplane 'Wi satisfies fathi /8) = ~. \nWe can therefore find Ii E B!~ whose output values are within 'Yi /2 of 'Wi. We now \nconsider the tree T' obtained by replacing the node perceptrons 'Wi of T with the \ncorresponding Ii. This tree performs the same classification function on the first \nhalf of the sample, and the margin remains larger than 'Yi - \"1\".12 > \"11&.12. If a \npoint in the second half of the sample is incorrectly classified by T it will either \nstill be incorrectly classified by the adapted tree T' or will at one of the decision \nnodes i in T' be closer to the decision boundary than 'YI&i /2. The point is thus \ndistinguishable from left hand side points which are both correctly classified and \nhave margin greater than \"11&.12 at node i. Hence, that point must be kept on the \nright hand side in order for the condition to be satisfied. Hence, the fraction of \npermutations that can be allowed for one choice of the functions from the covers \nis 2-\"\". We must take the union bound over all choices of the functions from the \ncovers. U sing the techniques of [11] the numbers of these choices is bounded by \nCorollory 4.3 as follows \n\nn~12(8m)I&.los(4emll&.) = 2K (8m)D, \n\nwhere D = ~~1 ~ log(4em/lei). The value of E in the lemma statement therefore \nensures that this the union bound is less than 6. \no \nUsing the standard lemma due to Vapnik [14, page 168] to bound the error proba(cid:173)\nbilities in terms of the discrepancy on a double sample, combined with Lemma 4.4 \ngives the following result. \n\nTheorem 4.5 Suppo,e we are able to cleu,i/y an m ,ample of labelled ezamplea \nwing a perceptron decilion tree with K node, and obtaining margina 'Yi at node i, \nthen we can bound the generali,ation error with probability greater than 1 - 6 to be \nIe\" than \n\n1 \n-(Dlog(4m) + log \nm \n\n(8m)K(2K) \nK \n(K + 1)6 \n\n) \n\nwhere D = E~l ~log(4em//cj) and lei = fathi/8). \nProof: We must bound the probabilities over different architectures of trees and \ndifferent margins. We simply have to choose the values of E to ensure that the \nindividual 6's are sufficiently small that the total over all possible choices is less \nthan 6. The details are omitted in this abstract. \no \n\n5 Experiments \n\nThe theoretical results obtained in the previous section imply that an algorithm \nwhich produces large margin splits should have a better generalization, since in(cid:173)\ncreasing the margins in the internal nodes, has the effect of decreasing the bound \non the test error. \nIn order to test this strategy, we have performed the following experiment, divided \nin two parts: first run a standard perceptron decision tree algorithm and then for \neach decision node generate a maximal margin hyperplane implementing the same \ndichotomy in place of the decision boundary generated by the algorithm. \n\n\fData-Dependent SRMfor Perceptron Decision Trees \n\n341 \n\nInput: Random m sample x with corresponding classification b. \nAlgorithm: Find a perceptron decision tree T which correctly classifies the sample \n\nusing a standard algorithm; \nLet Ie = number of decision nodes of Tj \nFrom tree T create T' by executing the following loop: \nFor each decision node i replace the weight vector w, by the vector wi \nwhich realises the maximal margin hyperplane agreeing with w, on the \n\nset of inputs reaching node i; \n\nLet the margin of w~ on the inputs reaching node i be 'Y,j \n\nOutput: Classifier T', with bound on the generalisation error in terms of the num-\nber of decision nodes K and D = 2:~11e, log(4em/~) where Ie, = fath,/8). \nNote that the classification of T and T' agree on the sample and hence, that T' is \nconsistent with the sample. \nAs a PDT learning algorithm we have used OC1 [8], created by Murthy, Kasif and \nSalzberg and freely available over the internet. It is a randomized algorithm, which \nperforms simulated annealing for learning the perceptrons. The details about the \nrandomization, the pruning, and the splitting criteria can be found in [8]. \nThe data we have used for the test are 4 of the 5 sets used in the original OC1 \npaper, which are publicly available in the UCI data repository [16]. \nThe results we have obtained on these data are compatible with the ones reported \nin the original OC1 paper, the differences being due to different divisions between \ntraining and testing sets and their sizesj the absence in our experiments of cross(cid:173)\nv&l.idation and other techniques to estimate the predictive accuracy of the PDT; \nand the inherently randomized nature of the algorithm. \nThe second stage of the experiment involved finding - for each node - the hyperplane \nwhich performes the lame split as performed by the OC1 tree but with the ma.ximal \nmargin. This can be done by considering the subsample reaching each node as \nperfectly divided in two parts, and feeding the data accordingly relabelled to an \nalgorithm which finds the optimal split in the linearly separable case. The ma.ximal \nmargin hyperplanes are then placed in the decision nodes and the new tree is tested \non the same testing set. \nThe data sets we have used are: Wi,eoun,in Brealt Caneer, Pima Indiana Diabetel, \nBOlton Houling transformed into a classification problem by thresholding the price \nat \u2022 21.000 and the classical Inl studied by Fisher (More informations about the \ndatabases and their authors are in [8]). All the details about sample sizes, number \nof attributes and results (training and testing accuracy, tree size) are summarised \nin table 1. \nWe were not particularly interested in achieving a high testing accuracy, but rather \nin observing if improved performances can be obtained by increasing the margin. \nFor this reason we did not try to optimize the performance of the original classifier \nby using cross-v&l.idation, or a convenient training/testing set ratio. The relevant \nquantity, in this experiment, is the different in the testing error between a PDT \nwith arbitrary margins and the same tree with optimized margins. This quantity \nhas turned out to be always positive, and to range from 1.7 to 2.8 percent of gain, \non test errors which were already very low. \n\nCANC 96.53 \n96.67 \nIRIS \n89.00 \nDIAB \nHOUS \n95.90 \n\ntrain OC1 test FAT test #trs #ts attrib. \n9 \n4 \n8 \n13 \n\n93.52 \n96.67 \n70.48 \n81.43 \n\n95.37 \n98.33 \n72.45 \n84.29 \n\n249 \n90 \n209 \n306 \n\n108 \n60 \n559 \n140 \n\nclasses nodes \n1 \n2 \n4 \n7 \n\n2 \n3 \n2 \n2 \n\n\f342 \n\nReferences \n\n1. Shawe-Taylor and N. Cristianini \n\n[1] Ncga Alon, Shai Ben-David, Nicolo Cesa-Bianchi and David Haussler, \"Scale(cid:173)\n\nsensitive Dimensions, Uniform Convergence, and Learnability,\" in Proceeding. \nof the Conference on Foundation. of Computer Science (FOCS), (1993). Also \nto appear in Journal of the ACM. \n\n[2] Martin Anthony and Peter Bartlett, \"Function learning from interpolation\", \nTechnical Report, (1994). (An extended abstract appeared in Computational \nLearning Theory, Proceeding. 2nd European Conference, EuroCOLT'95, pages \n211-221, ed. Paul Vitanyi, (Lecture Notes in Artificial Intelligence, 904) \nSpringer-Verlag, Berlin, 1995). \n\n[3] Peter L. Bartlett and Philip M. Long, \"Prediction, Learning, Uniform Con(cid:173)\n\nvergence, and Scale-Sensitive Dimensions,\" Preprint, Department of Systems \nEngineering, Australian National University, November 1995. \n\n[4] Peter L. Bartlett, Philip M. Long, and Robert C. Williamson, \"Fat-shattering \n\nand the learnability of Real-valued Functions,\" Journal of Computer and Sy.(cid:173)\ntem Science., 52(3), 434-452, (1996). \n\n[5] Breiman L., Friedman J.H., Olshen R.A., Stone C.J., \"Classification and Re(cid:173)\n\ngression Trees\", Wadsworth International Group, Belmont, CA, 1984. \n\n[6] Brodley C.E., UtgofF P.E., Multivariate Decision Trees, Machine Learning 19, \n\npp. 45-77, 1995. \n\n[7] Michael J. Kearns and Robert E. Schapire, \"Efficient Distribution-free Learning \nof Probabilistic Concepts,\" pages 382-391 in Proceeding. of the Slit Sympo.ium \non the Foundation. of Computer Science, IEEE Computer Society Press, Los \nAlamitos, CA, 1990. \n\n[8] Murthy S.K., Kasif S., Salzberg S., A System for Induction of Oblique Decision \n\nTrees, Journal of Artificial Intelligence Research, 2 (1994), pp. 1-32. \n\n[9] Quinlan J.R., \"C4.5: Programs for Machine Learning\", Morgan Kaufmann, \n\n1993. \n\n[10] Sankar A., Mammone R.J., Growing and Pruning Neural Tree Networks, IEEE \n\nTransactions on Computers, 42:291-299, 1993. \n\n[11] John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, Martin Anthony, \nStructural Risk Mjnjmization over Data-Dependent Hierarchi~, NeuroCOLT \nTechnical Report NC-TR-96-053, 1996. \n(ftp:llftp.dc \u2022\u2022 rhbDc.ac.uk/pub/Deurocolt/t.c~.port.). \n\n[12] J.A. Sirat, and J.-P. Nadal, \"Neural trees: a new tool for classification\", Net(cid:173)\n\nwork, 1, pp. 423-438, 1990 \n\n[13] UtgofF P.E., Perceptron Trees: a Case Study in Hybrid Concept Representa(cid:173)\n\ntions, Connection Science 1 (1989), pp. 377-391. \n\n[14] Vladimir N. Vapnik, E.timation of Dependence. Baled on Empirical Data, \n\nSpringer-Verlag, New York, 1982. \n\n[15] Vladimir N. Vapnik, The Nature of Statiltical Learning Theory, Springer(cid:173)\n\nVerlag, New York, 1995 \n\n[16] University of California, \n\nIrvine \n\nhttp://www.icB.uci.edu/ mlearn/MLRepoBitory.html \n\nMachine Learning Repository, \n\n\f", "award": [], "sourceid": 1359, "authors": [{"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}, {"given_name": "Nello", "family_name": "Cristianini", "institution": null}]}