{"title": "Learning Stochastic Perceptrons Under k-Blocking Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 279, "page_last": 286, "abstract": null, "full_text": "Learning Stochastic Perceptrons Under \n\nk-Blocking Distributions \n\nMario Marchand \n\nSaeed Hadjifaradji \n\nOttawa-Carleton Institute for Physics \n\nOttawa-Carleton Institute for Physics \n\nUniversity of Ottawa \n\nOttawa, Ont., Canada KIN 6N5 \n\nmario@physics.uottawa.ca \n\nUniversity of Ottawa \n\nOttawa, Ont., Canada KIN 6N5 \n\nsaeed@physics.uottawa.ca \n\nAbstract \n\nWe present a statistical method that PAC learns the class of \nstochastic perceptrons with arbitrary monotonic activation func(cid:173)\ntion and weights Wi E {-I, 0, + I} when the probability distribution \nthat generates the input examples is member of a family that we \ncall k-blocking distributions. Such distributions represent an impor(cid:173)\ntant step beyond the case where each input variable is statistically \nindependent since the 2k-blocking family contains all the Markov \ndistributions of order k. By stochastic percept ron we mean a per(cid:173)\nceptron which, upon presentation of input vector x, outputs 1 with \nprobability fCLJi WiXi - B). Because the same algorithm works for \nany monotonic (nondecreasing or nonincreasing) activation func(cid:173)\ntion f on Boolean domain, it handles the well studied cases of \nsigmolds and the \"usual\" radial basis functions. \n\n1 \n\nINTRODUCTION \n\nWithin recent years, the field of computational learning theory has emerged to pro(cid:173)\nvide a rigorous framework for the design and analysis of learning algorithms. A \ncentral notion in this framework, known as the \"Probably Approximatively Cor(cid:173)\nrect\" (PAC) learning criterion (Valiant, 1984), has recently been extended (Hassler, \n1992) to analyze the learn ability of probabilistic concepts (Kearns and Schapire, \n1994; Schapire, 1992). Such concepts, which are stochastic rules that give the prob(cid:173)\nability that input example x is classified as being positive, are natural probabilistic \n\n\f280 \n\nMario Marchand, Saeed Hadjifaradji \n\nextensions of the deterministic concepts originally studied by Valiant (1984). \nMotivated by the stochastic nature of many \"real-world\" learning problems and by \nthe indisputable fact that biological neurons are probabilistic devices, some prelimi(cid:173)\nnary studies about the PAC learnability of simple probabilistic neural concepts have \nbeen reported recently (Golea and Marchand, 1993; Golea and Marchand, 1994). \nHowever, the probabilistic behaviors considered in these studies are quite specific \nand clearly need to be extended. Indeed, only classification noise superimposed \non a deterministic signum function was considered in Golea and Marchand (1993). \nThe probabilistic network, analyzed in Golea and Marchand (1994), consists of a \nlinear superposition of signum functions and is thus solvable as a (simple) case of \nlinear regression. What is clearly needed is the extension to the non-linear cases \nof sigmolds and radial basis functions. Another criticism about Golea and Marc(cid:173)\nhand (1993, 1994) is the fact that their learn ability results was established only \nfor distributions where each input variable is statistically independent from all the \nothers (sometimes called product distributions). In fact, very few positive learning \nresults for non-trivial p-concepts classes are known to hold for larger classes of dis(cid:173)\ntributions. Therefore, in an effort to find algorithms that will work in practice, we \nintroduce in this paper a new family of distributions that we call k-blocking. As we \nwill argue, this family has the dual advantage of avoiding malicious and unnatural \ndistributions that are prone to render simple concept classes unlearnable (Lin and \nVitter, 1991) and of being likely to contain several distributions found in practice. \nOur main contribution is to present a simple statistical method that PAC learns (in \npolynomial time) the class of stochastic perceptrons with monotonic (but otherwise \narbitrary) activation functions and weights Wi E { -1,0, + 1} when the input exam(cid:173)\nples are generated according to any distribution member of the k-blocking family. \nDue to space constraints, only a sketch of the proofs is presented here. \n\n2 DEFINITIONS \n\nThe instance (input) space, In, is the Boolean domain {-1, + 1 } n. The set of all \ninput variables is denoted by X. Each input example x is generated according to \nsome unknown distribution D on In. We will often use PD(X), or simply p(x), to \ndenote the probability of observing the vector value x under distribution D. If U \nand V are two disjoint subsets of X, Xu and Xv will denote the restriction (or \nprojection) of x over the variables of U and V respectively and p D (xu I xv) will \ndenote the probability, under distribution D, of observing the vector value Xu (for \nthe variables in U) given that the variables in V are set to the vector value xv. \n\nFollowing Kearns and Schapire (1994), a probabilistic concept (p-concept) is a map \nc : In ~ [0, 1] for which c(x) represents the probability that example X is classified \nas positive. More precisely, upon presentation of input x, an output of a = 1 is \ngenerated (by an unknown target p-concept) with probability c(x) and an output \nof a = 0 is generated with probability 1 - c(x). \nA stochastic perceptron is a p-concept parameterized by a vector of n weights Wi \nand a activation function fO such that, the probability that input example X is \n\n\fLearning Stochastic Perceptrons under k-Blocking Distributions \n\nclassified as positive is given by \n\n281 \n\n(1) \n\nWe consider the case of a non-linear function fO since the linear case can be solved \nby a standard least square approximation like the one performed by Kearns in \nSchapire (1994) for linear sums of basis functions. We restrict ourselves to the case \nwhere fO is monotonic i.e. either nondecreasing or nonincreasing. But since any \nnonincreasing f (.) combined with a weight vector w can always be represented by a \nnondecreasing f(\u00b7) combined with a weight vector -w, we can assume without loss \nof generality that the target stochastic perceptron has a nondecreasing f (. ). Hence, \nwe allow any sigmoid-type of activation function (with arbitrary threshold). Also, \nsince our instance space zn is on an-sphere, eq. 1 also include any nonincreasing \nradial basis function of the type \u00a2(z2) where z = Ix - wi and w is interpreted as \nthe \"center\" of \u00a2. The only significant restriction is on the weights where we allow \nonly for Wi E {-I, 0, +1}. \nAs usual, the goal of the learner is to return an hypothesis h which is a good ap(cid:173)\nproximation of the target p-concept c. But, in contrast with decision rule learning \nwhich attempts to \"filter out\" the noisy behavior by returning a deterministic hy(cid:173)\npothesis, the learner will attempt the harder (and more useful) task of modeling \nthe target p-concept by returning a p-concept hypothesis. As a measure of error \nbetween the target and the hypothesis p-concepts we adopt the variation distance \ndv (\u00b7,\u00b7) defined as: \n\nerr(h,c) = dv(h,c) ~f LPD(X) Ih(x) - c(x)1 \n\nx \n\n(2) \n\nWhere the summation is over all the 2n possible values of x. Hence, the same D is \nused for both training and testing. The following formulation of the PAC criterion \n(Valiant, 1984; Hassler, 1992) will be sufficient for our purpose. \n\nDefinition 1 Algorithm A is said to PAC learn the class C of p-concepts by using \nthe hypothesis class H (of p-concepts) under a family V of distributions on instance \nspace In, iff for any c E C, any D E V, any 0 < t,8 < 1, algorithm A returns in a \ntime polynomial in (l/t, 1/8, n), an hypothesis h E H such that with probability at \nleast 1 - 8, err(h, c) < t. \n\n3 K-BLOCKING DISTRIBUTIONS \n\nTo learn the class of stochastic perceptrons, the algorithm will try to discover each \nweight Wi that connects to input variable Xi by estimating how the probability \nof observing a positive output (0\" = 1) is affected by \"hard-wiring\" variable Xi to \nsome fixed value. This should clearly give some information about Wi when Xi \nis statistically independent from all the other variables as was the case for Golea \nand Marchand (1993) and Schapire (1992). However, if the input variables are \ncorrelated, then the process of fixing variable Xi will carry over neighboring variables \nwhich in turn will affect other variables until all the variables are perturbed (even \nin the simplest case of a first order Markov chain). The information about Wi will \n\n\f282 \n\nMario Marchand, Saeed HadjiJaradji \n\nthen be smeared by all the other weights. Therefore, to obtain information only \non Wi, we need to break this \"chain reaction\" by fixing some other variables. The \nnotion of blocking sets serves this purpose. \nLoosely speaking, a set of variables is said to be a blocking set1 for variable Xi \nif the distribution on all the remaining variables is unaffected by the setting of Xi \nwhenever all the variables of the blocking set are set to a fixed value. More precisely, \nwe have: \n\nDefinition 2 Let B be a subset of X and let U = X - (B U {Xi}). Let XB and Xu \nbe the restriction of X on Band U respectively and let b be an assignment for XB. \nThen B is said to be a blocking set for variable Xi (with respect to D), iff: \n\nPD(xulxB = b,Xi = +1) = PD(xulxB = b,Xi = -1) \n\nfor all b and Xu \n\nIn addition, if B is not anymore a blocking set when we remove anyone of its \nvariables, we then say that B is a minimal blocking set for variable Xi. \n\nWe thus adopt the following definition for the k-blocking family. \nDefinition 3 Distribution D on rn is said to be k-blocking iff IBil < k \n1,2\u00b7\u00b7 . n when each Bi is a minimal blocking set for variable Xi. \n\nfor i = \n\nThe k-blocking family is quite a large class of distributions. In fact we have the \nfollowing property: \n\nProperty 1 All Markov distributions of kth order are members of the 2k-blocking \nfamily. \n\nProof: By kth order Markov distributions, we mean distributions which can be \nexactly written as a Chow(k) expansion (see Hoeffgen, 1993) for some permuta(cid:173)\ntion of the variables. We prove it here (by using standard techniques such as \nin Abend et. al, 1965) for first order Markov distributions, the generalization for \nk > 1 is straightforward. Recall that for Markov chain distributions we have: \np(XjIXj-b\u00b7\u00b7\u00b7 xI) = p(XjIXj_l) for 1 < j ~ n. Hence: \nP(XI ... Xj-2, Xj+2\u00b7 .. XnlXj-b Xj, Xj+!) \n\n= p(Xl)p(X2Ixl)\u00b7\u00b7\u00b7 p(Xj IXj-l)p(Xj+llxj)\u00b7\u00b7\u00b7 P(XnIXn-l)!p(Xj-b Xj, Xj+!) \n= p(xI)p(x2Ixd\u00b7\u00b7\u00b7 p(xj-llxj-2)P(Xj+2Ixj+l)\u00b7 .. P(XnIXn-l)!p(Xj-I) \n= P(Xl\u00b7\u00b7 \u00b7Xj-2,Xj+2\u00b7 \u00b7\u00b7XnIXj-bXj,Xj+!) \n\nwhere Xj denotes the negation of Xj. Thus, we see that Markov chain distributions \nare a special case of 2-blocking distributions: the blocking set of each variable \nconsisting only of the two first-neighbor variables. D. \nThe proposed algorithm for learning stochastic perceptrons needs to be provided \nwith a blocking set (of at most k variables) for each input variable. Hoeffgen (1993) \nhas recently proven that Chow(l) and Chow(k > 1) expansions are efficiently learn(cid:173)\nable; the latter under some restricted conditions. We can thus use these algorithms \n\nIThe wording \"blocking set\" was also used by Hancock & Mansour (Proc. of COLT'91 , \n179-183, Morgan Kaufmann Publ.) to denote a property of the target concept. In contrast, \nour definition of blocking set denotes a property of the input distribution only. \n\n\fLearning Stochastic Perceptrons under k-Blocking Distributions \n\n283 \n\nto discover the blocking sets for such distributions. However, the efficient learn(cid:173)\nability of unrestricted Chow(k > 1) expansions and larger classes of distributions, \nsuch as the k-blocking family, is still unknown. In fact, from the hardness results of \nHoeffgen (1993), we can see that it is definitely very hard (perhaps NP-complete) \nto find the blocking sets if the learner has no information available other than the \nfact that the distribution is k-blocking. On the other hand, we can argue that the \n\"natural\" ordering of the variables present in many \"real-world\" situations is such \nthat the blocking set of any given variable is among the neighboring variables. In \nvision for example, we expect that the setting of a pixel will directly affect only \nthose located in it's neighborhood; the other pixels being affected only through this \nneighborhood. In such cases, the neighborhood of a variable \"naturally\" provides \nits blocking set. \n\n4 LEARNING STOCHASTIC PERCEPTRONS \n\nWe first establish (the intuitive fact) that, without making much error, we can \nalways consider that the target p-concept is defined only over the variables which \nare not almost always set to the same value. \n\nLemma 1 Let V be a set of v variables Xi for which Pr(xi = ai) > 1 - a. Let \nc be a p-concept and let c' be the same p-concept as c except that the reading of \neach variable Xi E V is replaced by the reading of the constant value ai. Then \nerr( c' , c) < v . a. \n\nProof: Let a be the vector obtained from the concatenation of all ais and let Xv \nbe the vector obtained from X by keeping only the components Xi which are in V. \nThen err(c', c) ~ Pr(xv =I- a) ~ L:iEVPr(Xi =I- ai). D. \nFor a given set of blocking sets {Bi }f=l ' the algorithm will try to discover each \nweight Wi by estimating the blocked influence of Xi defined as: \n\nBinf(xilhi) ~f Pr(O' = 11xBi = hi , Xi = +1) - Pr(O' = 11xBi = hi, Xi = -1) \n\nwhere XB i denotes the restriction of x on the blocking set Bi for variable Xi and hi \nis an assignment for XB i \u2022 The following lemma ensures the learner that Binf(xilhi) \ncontains enough information about Wi. \n\nLemma 2 Let the taryet p-concept be a stochastic perceptron on In having a non(cid:173)\ndecreasing activation function and weights taken from {-1, 0, + 1 }. Then, for any \nassignment hi for the variables in the blocking set Bi of variable Xi, we have: \n\nBinf(xilhi) \n\n{ \n\n~ 0 \n= 0 \n~ 0 \n\nif Wi = +1 \nif Wi = 0 \nif Wi = -1 \n\n(3) \n\n(Bi U {Xi}), s = L:jEUWjXj and ( = L:kEBi wkbk\u00b7 \nProof sketch: Let U = X -\nLet pes) denote the probability of observing s (under D). Then Binf(xilhi) = \nf(s + (- Wi)]; from which we find the desired result for a \nL:sp(s) [f(s + (+ Wi) -\nnondecreasing f(\u00b7) . D. \n\n\f284 \n\nMario Marchand, Saeed Hadjifaradji \n\nIn principle, lemma 2 enables the learner to discover Wi from Binf(xilbi). The \nlearner, however, has only access to its empirical estimate Bfnf(xilbi) from a finite \nsample. Hence, we will use Hoeffding's inequality (Hoeffding, 1963) to find the \nnumber of examples needed for a probability p to be close to its empirical estimate \nft with high probability. \n\nLemma 3 (Hoeffding, 1963) Let YI. ... , Ym be a sequence of m independent \nBernoulli trials, each succeeding with probability p. Let ft = L:l ~/m. Then: \n\nPr (1ft - pi > E) ~ 2 exp (-2mE2) \n\nHence, by writing Binf(xilbi) in terms of (unconditional) probabilities that can be \nestimated from all the training examples, we find from lemma 3 that the number \nmO(E,8,n) of examples needed to have IBfnf(xilbi) - Binf(xilbi)I < E with proba(cid:173)\nbility at least 1 - 8 is given by: \n\nmO(E,8,n) ~ ~ (:E) 2 ln (~) \n\nwhere /'i, = a k +1 is the lowest permissible value for PD(bi , Xi) (see lemma 1). So, if \nthe minimal nonzero value for IBinf(xilbi)1 is (3, then the number of examples needed \nto find, with confidence at least 1 - 8, the exact value for Wi among { -1,0, + I} is \nsuch that we need to have: Pr(IBfnf(xilbi) - Binf(xilbi)1 < (3/2) > 1 - 8. Thus, \nwhenever (3 is oH2(e-n ), we will need of O(e2n ) examples to find (with prob > 1-8) \nthe value for Wi. So, in order to be able to PAC learn from a polynomial sample, \nwe must arrange ourselves so that we do not need to worry about such low values \nfor IBinf(xilbi)l. We therefore consider the maximum blocked influence defined as: \n\nBinf(xi) ~f Binf(xilbn \n\nwhere b; is the vector value for which IBinf(xilbi)1 is the largest. We now show \nthat the learner can ignore all variables Xi for which IBinf(xi)1 is too small (without \nmaking much error). \n\nLemma 4 Let c be a stochastic perceptron with nondecreasing activation function \nf (.) and weights taken from { -1, 0, + 1 }. Let V c X and let cv be the same stochas(cid:173)\ntic perceptron as c except that Wi = 0 for all Xi E V and its activation function is \nchanged to f ( . + e). Then, there always exists a value for e such that: \n\nerr(cv, c) \n\n:::; 2: IBinf(xi)I \n\niEV \n\nProof sketch: By induction on IVI. To first verify the lemma for V = {Xl}, let \nb be a vector of values for the setting of all Xi E Bl and let Xu be a vector of \nvalues for the setting of all Xj E U = X - (Bl U {Xl}). Let s = LjEU WjXj and \n( = LjEB l WjXj, then for e = WI, we have: \n\nerr(cv, c) = 2: 2:PD(xul b )PD(blxl = -l)PD(XI = -1) \n\nXu b \n\nx If(s + ( + WI) -\n\nf(s + ( - wl)1 \n\n~ IBinf(xdl \n\n\fLearning Stochastic Perceptrons under k-Blocking Distributions \n\n285 \n\nWe now assume that the lemma holds for V = {Xl. X2\u00b7\u00b7 . Xk} and prove it for \nW = V U {Xk+1}. Let S = {Xk+1} and let f(\u00b7 + Ow), f(\u00b7 + Ov) and f(\u00b7 + Os) \ndenote respectively the activation function for cw, Cv and cs. By inspecting \nthe expressions for err(cv, c) and err(cw, cs), we can see that there always ex(cid:173)\nist a value for Ow E {Ov + Wk+1,OV - Wk+l} and Os E {Wk+l. -wk+d such \nthat err( cw, cs) ::; err( cv, c). And since dv (-, .) satisfies the triangle inequality, \nerr(cw,c)::; err(cv,c) + IBinf(xk+1)I. D. \nAfter discovering the weights, the hypothesis p-concept h returned by the learner \nwill simply be the table look-up of the estimated probabilities of observing a positive \nclassification given that ~~= 1 Wi Xi = s for all s values that are observed with \nsufficient probability (the hypothesis can output any value for the values of s that are \nobserved very rarely). We thus have the following learning algorithm for stochastic \nperceptrons. \n\nAlgorithm LearnSP(n, \u20ac, 6, {Bi}i=l) \n\n1. Call m = 128 e: ) 2kHIn e~n) training examples (where k = maXi I Bi I). \n\n2. Compute Pr(xi = +1) for each variable Xi. Neglect Xi whenever we have \n\nPr(xi = +1) < \u20ac/(4n) or Pr(xi = +1) > 1 - \u20ac/(4n). \n\n3. For each variable Xi and for each of its blocking vector value hi, compute \nBfnf(xilhi). Let h; be the value of hi for which IBfnf(xilhi)1 is the largest. \nLet Bfnf(xi) = Bfnf(xilh;). \n\n4. For each variable Xi: \n\n(a) Let Wi = +1 whenever Bfnf(xi) > \u20ac/(4n). \n(b) Let Wi = -1 whenever Bfnf(xi) < \u20ac/(4n). \n(c) Otherwise let Wi = 0 \n\n5. Compute Pr(~~=l Wi Xi = s) for s = -n, . .. + n. \n6. Return the hypothesis p-concept h formed by the table look-up: \n\nh(x) = h'(s) = Pr (0' = 1 t WiXi = s) \n\n~=l \n\nfor all s for which Pr(~~=l WiXi = s) > \u20ac/(8n + 8). For the other s values, \nlet h'(s) = 0 (or any other value). \n\nTheorem 1 Algorithm LearnSP PAC learns the class of stochastic perceptrons \non In with monotonic activation functions and weights Wi E {-1, 0, + 1} under \nany k-blocking distribution (when a blocking set for each variable is known). The \nnumber of examples required is m = 128 (2:) 2kHIn (l~n) (and the time needed is \nO(n x m)) for the returned hypothesis to make error at most \u20ac with confidence at \nleast 1 - 6. \n\nProof sketch: From Hoeffding's inequality (lemma 3) we can show that this sample \nsize is sufficient to ensure that: \n\n\f286 \n\nMario Marchand, Saeed Hadjifaradji \n\n\u2022 IPr(Xi = +1) - Pr(xi = +1)1 < ~/(4n) with confidence at least 1 - 6/(4n) \n\u2022 IBfnf(xi) - Binf(xi)1 < ~/(4n) with confidence at least 1 - 6/(4n) \n\u2022 IPr(I:~=l WiXi = s) - Pr(I:~=l WiXi = s)1 < ~2/[64(n + 1)] with confidence \n\nat least 1- 6/(4n + 4) \n\n\u2022 IPr(O\" = llI:~=l WiXi = s) - Pr(O\" = llI:~=l WiXi = s)1 < ~/4 with confi(cid:173)\n\ndence at least 1 - 6/4 \n\nFrom this and from lemma 1,2 and 4, it follows that returned hypothesis will make \nerror at most ~ with confidence at least 1 - 6. D. \n\nAcknowledgments \n\nWe thank Mostefa Golea, Klaus-U. Hoeffgen and Stefan Poelt for useful comments \nand discussions about technical points. M. Marchand is supported by NSERC grant \nOGP0122405. Saeed Hadjifaradji is supported by the MCHE of Iran. \n\nReferences \n\nAbend K., Hartley T.J. & Kanal L.N. (1965) \"Classification of Binary Random \nPatterns\", IEEE Trans. Inform. Theory vol. IT-II, 538-544. \nGolea, M. & Marchand M. (1993) \"On Learning Perceptrons with Binary Weights\", \nNeural Computation vol. 5, 765-782. \nGolea, M. & Marchand M. (1994) \"On Learning Simple Deterministic and Prob(cid:173)\nabilistic Neural Concepts\", in Shawe-Talor J. , Anthony M. (eds.), Computational \nLearning Theory: EuroCOLT'93, Oxford University Press, pp. 47-60. \nHaussler D. (1992) \"Decision Theoritic Generalizations of the PAC Model for Neural \nNet and Other Learning Applications\", Information and Computation vol. 100,78-\n150. \n\nHoeffgen K.U. (1993) \"On Learning and Robust Learning of Product Distributions\", \nProceedings of the 6th ACM Conference on Computational Learning Theory, ACM \nPress, 77-83. \nHoeffding W. (1963) \"Probability inequalities for sums of bounded random vari(cid:173)\nabIes\", Journal of the American Statistical Association, vol. 58(301), 13-30. \nKearns M.J. and Schapire R.E. (1994) \"Efficient Distribution-free Learning ofProb(cid:173)\nabilistic Concepts\", Journal of Computer and System Sciences, Vol. 48, pp. 464-497. \nLin J.H. & Vitter J.S. (1991) \"Complexity Results on Learning by Neural Nets\", \nMachine Learning, Vol. 6, 211-230. \nSchapire R.E. (1992) The Design and Analysis of Efficient Learning Algorithms, \nCambridge MA: MIT Press. \nValiant L.G. (1984) \"A Theory of the Learnable\", Comm. ACM, Vol. 27, 1134-\n1142. \n\n\f", "award": [], "sourceid": 878, "authors": [{"given_name": "Mario", "family_name": "Marchand", "institution": null}, {"given_name": "Saeed", "family_name": "Hadjifaradji", "institution": null}]}