{"title": "Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 921, "page_last": 928, "abstract": null, "full_text": "Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions\nPhilip M. Long Google Mountain View, CA plong@google.com Rocco A. Servedio Department of Computer Science Columbia University New York, NY rocco@cs.columbia.edu\n\nAbstract\nWe consider the well-studied problem of learning decision lists using few examples when many irrelevant features are present. We show that smooth boosting algorithms such as MadaBoost can efficiently learn decision lists of length k over n boolean variables using poly(k , log n) many examples provided that the marginal distribution over the relevant variables is \"not too concentrated\" in an L 2 -norm sense. Using a recent result of Hastad, we extend the analysis to obtain a similar (though quantitatively weaker) result for learning arbitrary linear threshold functions with k nonzero coefficients. Experimental results indicate that the use of a smooth boosting algorithm, which plays a crucial role in our analysis, has an impact on the actual performance of the algorithm.\n\n1 Introduction\nA decision list is a Boolean function defined over n Boolean inputs of the following form: if 1 then b1 else if 2 then b2 ... else if k then bk else bk+1 . Here 1, ..., k are literals defined over the n Boolean variables and b1 , . . . , bk+1 are Boolean values. Since the work of Rivest [24] decision lists have been widely studied in learning theory and machine learning. A question that has received much attention is whether it is possible to attribute-efficiently learn decision lists, i.e. to learn decision lists of length k over n variables using only poly(k , log n) many examples. This question was first asked by Blum in 1990 [3] and has since been re-posed numerous times [4, 5, 6, 29]; as we now briefly describe, a range of partial results have been obtained along different lines. Several authors [4, 29] have noted that Littlestone's Winnow algorithm [17] can learn decision lists of length k using 2O(k) log n examples in time 2O(k) n log n. Valiant [29] and Nevo and El-Yaniv [21] sharpened the analysis of Winnow in the special case where the decision list has only a bounded number of alternations in the sequence of output bits b1 , . . . , bk+1 . It is well known that the \"halving algorithm\" (see [1, 2, 19]) can learn length-k decision lists using only O(k log n) examples, but the running time of the algorithm is nk . Klivans and Servedio [16] used polynomial threshold functions together with Winnow to obtain a tradeoff between running time and the number of examples ~ 1/3 ~ 1/3 required, by giving an algorithm that runs in time nO(k ) and uses 2O(k ) log n examples. In this work we take a different approach by relaxing the requirement that the algorithm work under any distribution on examples or in the mistake-bound model. This relaxation in fact allows us to handle not just decision lists, but arbitrary linear threshold functions with k nonzero coefficients. (Recall\n\n\f\nThe approach and results. We will analyze a smooth boosting algorithm (see Section 2) together with a weak learner that exhaustively considers all 2n possible literals x i , xi as weak hypotheses. The algorithm, which we call Algorithm A, is described in more detail in Section 6. The algorithm's performance can be bounded in terms of the L 2 -norm of the distribution over examx 2 1/2 ples. Recall that the L2 -norm of a distribution D over a finite set X is D 2 := ( . X D (x) ) The L2 norm can be used to evaluate the \"spread\" of a probability distribution: if the probability is concentrated on a constant number of elements of the domain then the L 2 norm is constant, hereas w if the probability mass is spread uniformly over a domain of size N then the L 2 norm is 1/ N . Our main results are as follows. Let D be a distribution over {-1, 1} n. Suppose the target function f has k relevant variables. Let D rel denote the marginal distribution over {-1, 1}k induced by the relevant variables to f (i.e. if the relevant variables are xi1 , . . . , xik , then the value that D rel puts on an input (z1 , . . . , zk ) is PrxD [xi1 . . . xik = z1 . . . zk ]. Let Uk be the uniform distribution over {-1, 1}k and suppose that ||D rel ||2 /||Uk ||2 = . (Note that for any D we have 1, since Uk has minimal L2 -norm among all distributions over {-1, 1}k .) Then we have: Theorem 1 Suppose the target function is an arbitrary decision list in the setting described above. 1 Then given poly(log n, 1 , , log ) examples, Algorithm A runs in poly(n, , 1 , log 1 ) time and with probability 1 - constructs a hypothesis h that is -accurate with respect to D. Theorem 2 Suppose the target function is an arbitrary linear threshold function in the setting de2 ~ 1 scribed above. Then given poly(k , log n, 2O(( / ) ) , log ) examples, Algorithm A runs in poly(n, ~ 1 O (( / )2 ) , log ) time and with probability 1 - constructs a hypothesis h that is -accurate with 2 respect to D. Relation to Previous Work. Jackson and Craven [14] considered a similar approach of using Boolean literals as weak hypotheses for a boosting algorithm (in their case, AdaBoost). Jackson and Craven proved that for any distribution over examples, the resulting algorithm requires poly(K, log n) examples to learn any weight-K linear threshold function, i.e. any function of n then form sgn( i=1 wi xi - ) over Boolean variables where all weights wi are integers and i=1 |wi | K (this clearly implies that there are at most K relevant variables). It is well known [12, 18] that general decision lists of length k can only be expressed by linear threshold functions of weight 2(k) , and thus the result of [14] does not give an attribute efficient learning algorithm for decision lists. More recently Servedio [27] considered essentially the same algorithm we analyze in this work by specifically studying smooth boosting algorithms with the \"best-single-variable\" weak learner. He considered a general linear threshold learning problem (with no assumption that there are few relevant variables) and showed that if the distribution satisfies a margin condition then the algorithm has some level of resilience to malicious noise. The analysis of this paper is different from that of [27]; to the best of our knowledge ours is the first analysis in which the smoothness property of boosting is exploited for attribute efficient learning.\n\nn that a linear threshold function f : {-1, 1}n {-1, 1}n is a function f (x) = sgn( i=1 wi xi - ) where wi , are real numbers and the sgn function outputs the 1 numerical sign of its argument.)\n\n2 Boosting and Smooth Boosting\nFix a target function f : {-1, 1}n {-1, 1} and a distribution D over {-1, 1}n. A hypothesis function h : {-1, 1}n {-1, 1} is a -weak hypothesis for f with respect to D if ED [f h] . We sometimes refer to ED [f h] as the advantage of h with respect to f . We remind the reader that a boosting algorithm is an algorithm which operates in a sequence of stages and at each stage t maintains a distribution Dt over {-1, 1}n. At stage t the boosting algorithm is given a weak hypothesis ht for f with respect to D; the boosting algorithm then uses this to construct the next distribution Dt+1 over {-1, 1}n. After T such stages the boosting algorithm constructs a final hypothesis h based on the weak hypotheses h 1 , . . . , hT that is guaranteed to have high accuracy with respect to the initial distribution D. See [25] for more details.\n\n\f\nLet D1 , D2 be two distributions. For 1 we say that D1 is -smooth with respect to D2 if for all x {-1, 1}n, D1 (x)/D2 (x) . Following [15], we say that a boosting algorithm B is ( , )-smooth if for any initial distribution D and any distribution Dt that is generated starting from D when B is used to boost to -accuracy with -weak hypotheses at each stage, Dt is ( , )-smooth w.r.t. D. It is known that there are algorithms that are -smooth for = ( 1 ) with no dependence on , see e.g. [8]. For the rest of the paper B will denote such a smooth boosting algorithm. It is easy to see that every distribution D which is 1 -smooth w.r.t. the uniform distribution U satisfies 1 / . On the other hand, there are distributions D that are highly non-smooth D 2/ U 2 relative to U but which still have D 2 / U 2 small. For instance, the distribution D over {-1, 1}k 1 which puts weight 2k/2 on a single point and distributes the remaining weight uniformly on the other k 2 - 1 points is only 2k/2 -smooth (i.e. very non-smooth) but satisfies D 2 / Uk 2 = (1). Thus the L2 -norm condition we consider in this paper is a weaker condition than smoothness with respect to the uniform distribution.\n\n3 Total variation distance and L2 -norm of distributions\nThe total variation distance between two probability distributions D 1 , D2 over a finite set X is x dT V := maxS X D1 (S ) - D2 (S ) = 1 X |D1 (x) - D2 (x)| . It is easy to see that the total 2 variation distance between any two distributions is at most 1, and equals 1 if and only if the supports of the distributions are disjoint. The following is immediate: Lemma 1 For any two distributions D1 and D2 over a finite domain X , we have dT V (D1 , D2 ) = x 1- X min{D1 (x), D2 (x)}. We can bound the total variation distance between a distribution D and the uniform distribution in terms of the ratio D 2 / U 2 of the L2 -norms as follows:\n\nLemma 2 For any distribution D over a finite domain X , if U is the uniform distribution over X , ||U ||2 2 we have dT V (D, U ) 1 - 4||D||2 .\n2\n\n||2 Proof: Let M = ||D||2 . Since ||D||2 = ExD [D(x)], we have ExD [D(x)] = M 2 ||U ||2 = 2 2 ||U By Markov's inequality,\n\nM2 |X | .\n\nxD\n\nPr [D(x) 2M 2 U (x)] = Pr [D(x) \nxD\n\nBy Lemma 1, we have 1 - dT V (D, U ) = x x min{D(x), U (x)} x\n\n2M 2 ] 1/2. |X | min{D(x), U (x)}\n\n(1)\n\n:D (x)2M 2 U (x)\n\n:D (x)2M 2 U (x)\n\n1 D(x) , 2 2M 4M 2\n\nwhere the second inequality uses the fact that M 1 (so D(x)/2M 2 < D(x)) and the third inequality uses (1). Using the definition of M and solving for dT V (D, U ) completes the proof.\n\n4 Weak hypotheses for decision lists\nLet f be any decision list that depends on k variables: where each i is either \"(xi = 1)\" or \"(xi = -1).\" if 1 then output b1 else else if k then output bk else output bk+1 (2)\n\nThe following folklore lemma can be proved by an easy induction (see e.g. [12, 26] for proofs of essentially equivalent claims):\n\n\f\nLemma 3 The decision list f can be represented by a linear threshold function of the form f (x) = sgn(c1 x1 + + ck xk - ) where each ci = 2k-i and is an even integer in the range [-2k , 2k ]. It is easy to see that for any fixed c1 , . . . , ck as in the lemma, as x = (x1 , . . . , xk ) varies over {-1, 1}k the linear form c1 x1 + + ck xk will assume each odd integer value in the range [-2k , 2k ] exactly once. Now we can prove: Lemma 4 Let f be any decision list of length k over the n Boolean variables x 1 , . . . , xn . Let D be any distribution over {-1, 1}n, and let Drel denote the marginal distribution over {-1, 1}k induced by the k relevant variables of f . Suppose that dT V (Drel , Uk ) 1 - . Then there is some weak 2 hypothesis h {x1 , -x1 , . . . , xn , -xn , 1, -1} which satisfies EDrel [f h] 6 . 1 Proof: We first observe that by Lemma 3 and the well-known \"discriminator lemma\" of [23, 11], under any distribution D some weak hypothesis h from {x1 , -x1 , . . . , xn , -xn , 1, -1} must have 4 ED [f h] 21 . This immediately establishes the lemma for all 2k/2 , and thus we may suppose k 4 w.l.o.g. that > 2k/2 . We may assume w.l.o.g. that f is the decision list (2), that is, that the first literal concerns x 1 , the second concerns x2 , and so on. Let L(x) denote the linear form c1 x1 + + ck xk - from Lemma 3, so f (x) = sgn(L(x)). If x is drawn uniformly from {-1, 1}k , then L(x) is distributed uniformly over the 2k odd integers in the interval [-2k - , 2k - ], as c1 x1 is uniform over 2k , c2 x2 over 2k-1 , and so on. Let S denote the set of those x {-1, 1}k that satisfy |L(x)| 2k . Note that there are at most 4 k 4 2 + 1 elements in S , corresponding to L(x) = 1, 3, . . . , (2j - 1), where j is the greatest 4 integer such that 2j - 1 2k . Since > 2k/2 , certainly |S | 1 + 2k 2k . We thus have 4 4 2 PrUk [|L(x)| > 2k ] 1 - /2. It follows that PrDrel [|L(x)| > 2k ] (for otherwise we would 4 4 2 Now we follow the simple argument used to prove the \"discriminator lemma\" [23, 11]. We have EDrel [|L(x)|] = EDrel [f (x)L(x)] = c1 E[f (x)x1 ] + + ck E[f (x)xk ] - E[f (x)] 2 k 2 . (3) 8 have dT V (Drel , Uk ) > 1 - ), and consequently we have EDrel [|L(x)|] \n2 k 82 .\n\nRecalling that each |ci | = 2k-i , it follows that some h {x1 , -x1 , . . . , xn , -xn , 1, -1} must 2 2 satisfy EDrel [f h] ( 8 2k )/(2k-1 + + 20 + ||). Since || 2k this is at least 6 , and the proof 1 is complete.\n\n5 Weak hypotheses for linear threshold functions\nNow we consider the more general setting of arbitrary linear threshold functions. Though there are additional technical complications the basic idea is as in the previous section. We will use the following fact due to Hastad: Fact 3 (Hastad) (see [28], Theorem 9) Let f : {-1, 1}k {-1, 1} be any linear threshold funck tion that depends on all k variables x1 , . . . , xk . There is a representation sgn( i=1 wi xi - ) for f which is such that (assuming the weights w1 , . . . , wk are ordered by decreasing magnitude 1 = |w1 | |w2 | |wk | > 0) we have |wi | i!(k1 1) for all i = 2, . . . , k . + The main result of this section is the following lemma. The proof uses ideas from the proof of Theorem 2 in [28]. Lemma 5 Let f : {-1, 1}n {-1, 1} be any linear threshold function that depends on k variables. Let D be any distribution over {-1, 1}n, and let Drel denote the marginal distribution over {-1, 1}k induced by the k relevant variables of f . Suppose that dT V (Drel , Uk ) 1 - . Then there is some weak hypothesis h {x1 , -x1 , . . . , xn , -xn , 1, -1} which satisfies EDrel [f h] 2 ~ 1/(k 2 2O(1/ ) ).\n\n\f\nProof sketch: We may assume that f (x) = sgn(L(x)) where L(x) = w1 x1 + + wk xk - with w1 , . . . , wk as described in Fact 3. ~ Let := O(1/ 2 ) = O((1/ 2 )poly(log(1/ ))). (We will specify in more detail later.) Suppose first that k . By a well-known result of Muroga et al. [20], every linear threshold function f that depends on k variables can be represented using integer weights each of magnitude 2O(k log k) . Now the discriminator lemma [11] implies that for any distribution P , for some h {x1 , -x1 , . . . , xn , -xn , 1, -1} we have EP [f h] 1/2O(k log k) . If k and = 2 ~ ~ O((1/ 2 )poly(log(1/ ))), we have k log k = O (1/ 2 ). Thus, in this case, EP [f h] 1/2O(1/ ) , so the lemma holds if k . Thus we henceforth assume that < k . It remains only to show that EDrel [|L(x)|] 1/(k 2O(1/ ) ); once we have this, following (3) we get and now since each |wi | 1 (and w.l.o.g. || k ) this implies that some h satisfies EDrel [f h] 2 ~ 1/(k 2 2O(1/ ) ) as desired. Similar to [28] we consider two cases (which are slightly different from the cases in [28]). k 2 2 Case I: For all 1 i we have wi /( j =i wj ) > 2 /576. 2 k l 2 Let := wj n(8/ ). Recall the following version of Hoeffding's bound: for any j = +1\n2\n\n~\n\n2\n\n(4)\n\nEDrel [|L(x)|] = EDrel [f L] = w1 E[f (x)x1 ] + + wk E[f (x)xk ] - E[f (x)] 1/(k 2O(1/ ) ),\n\n~\n\n2\n\n0 = w Rk and any > 0, we have Prx{-1,1}k [|w x| w ] 2e- /2 (where we write k 2 w to denote i=1 wi ). This bound directly gives us that Pr [|w +1 x +1 + + wk xk | ] 2e-2 ln(8/)/2 = . (5) xUk 4 Moreover, the argument in [28] that establishes equation (4) of [28] also yields (6) Pr [|w1 x1 + + w x - | 2] xUk 4 in our current setting. (The only change that needs to be made to the argument of [28] is adjusting various constant factors in the definition of ). Equations (5) and (6) together yield PrxUk [|w1 x1 + + wk xk - | ] 1 - . Now as before, taken together with the dT V bound this yields 2 PrDrel [|L(x)| ] and hence we have EDrel [|L(x)|] /2. Since > w +1 and w +1 2 1/((k + 1)( + 1)!) by Fact 3, we have established (4) in Case I. k 2 2 Case II: For some value J we have wJ /( i=J wi ) 2 /576. Let us fix any setting z J -1 {-1, 1} of the variables x1 , . . . , xJ -1 . By an inequality due to Petrov [22] (see [28], Theorem 4) we have 6w J 6 k [|w1 z1 + +wJ -1 zJ -1 +wJ xJ + +wk xk -| wJ ] Pr =. xJ ,...,xk Uk-J +1 24 4 2 i=J wi Thus for each z {-1, 1}J -1 we have PrxUk [|L(x)| wJ | x1 . . . xJ -1 = z1 . . . zJ -1 ] . 4 This immediately yields PrxUk [|L(x)| > wJ ] 1 - , which in turn gives PrxDrel [|L(x)| > 4 w wJ ] 34 and hence EDrel [|L(x)|] 34 J by our usual arguments. Now (4) follows using Fact 3 and J .\n\n6 Putting it all together\nAlgorithm A works by running a ( 1 )-smooth boosting-by-filtering algorithm; for concreteness we use the MadaBoost algorithm of Domingo and Watanabe [8]. At the t-th stage of boosting,\n\n\f\nwhen MadaBoost simulates the distribution Dt , the weak learning algorithm works as follows: ) O( log n+lo2g(1/ ) many examples are drawn from the simulated distribution Dt , and these examples are used to obtain an empirical estimate of EDt [f h] for each h {x1 , -x1 , . . . , xn , -xn , -1, 1}. (Here is an upper bound on the advantage EDt [f h] of the weak hypotheses used at each stage; we discuss this more below.) The weak hypothesis used at this stage is the one with the highest 1 observed empirical estimate. The algorithm is run for T = O( 2 ) stages of boosting. Consider any fixed stage t of the algorithm's execution. As shown in [8], at most O( 1 ) draws from the original distribution D are required for MadaBoost to simulate a draw from the distribution D t . (This is a direct consequence of the fact that MadaBoost is O( 1 )-smooth; the distribution Dt is simulated using rejection sampling from D.) Standard tail bounds show that if the best hypothesis h has E[f h] then with probability 1 - the hypothesis selected will have E[f h] /2. In [8] it is shown that if MadaBoost always has an ( )-accurate weak hypothesis at each stage, then 1 after at most T = O( 2 ) stages the algorithm will construct a hypothesis which has error at most . Thus it suffices to take = O( 2). The overall number of examples used by Algorithm A is ) lo O( log n+2g4(1/ ). Thus to establish Theorems 1 and 2, it remains only to show that for any initial distribution D with Drel 2/ Uk 2 = , the distributions Dt that arise in the course of boosting are always such that the best weak hypothesis h {x1 , -x1 , . . . , xn , -xn , -1, 1} has sufficiently large advantage.\n\nSuppose f is a target function that depends on some set of k (out of n) variables. Consider what happens if we run a 1 -smooth boosting algorithm, where the initial distribution D satisfies r Drel / Uk = . At each stage we will have Dt el (x) 1 Drel (x) for all x {-1, 1}k , and consequently we will have\nr ||Dt el ||2 = 2\n\nx\n\n{-1,1}k\n\nr Dt el (x)2 \n\n1x 2\n\n{-1,1}k\n\nDrel (x)2 \n\n2 x 2\n\n{-1,1}k\n\nUk (x)2 .\n\nr Thus, by Lemma 2 each distribution Dt will satisfy dT V (Dt el , Uk ) 1 - 2/(4 2 ). Now Lemmas 4 and 5 imply that in both cases (decision lists and LTFs) the best weak hypothesis h does indeed have the required advantage.\n\n7 Experiments\nThe smoothness property enabled the analysis of this paper. Is smoothness really helpful for learning decision lists with respect to diffuse distributions? Is it critical? This section is aimed at addressing these questions experimentally. We compared the accuracy of the classifiers output by a number of smooth boosters from the literature with AdaBoost (which is known to not be a smooth booster in general, see e.g. Section 4.2 of [7]) on synthetic data in which the examples were distributed uniformly, and the class designations were determined by applying a randomly generated decision list. The number of relevant variables was fixed at 10. The decision list was determined by picking 1, ..., 10 and b1 , ..., b11 from (2) independently uniformly at random from among the possibilities. We evaluated the following algorithms: (a) AdaBoost [9], (b) MadaBoost [8], (c) SmoothBoost [27], and (d) a smooth booster proposed by Gavinsky [10]. Due to space constraints, we cannot describe each of these in detail.1 Each booster was used to reweight the training data, and in each round, the literal which minimized the weighted training error was chosen. Some of the algorithms choose the number of rounds of\n1 Very roughly speaking, AdaBoost reweights the data to assign more weight to examples that previously chosen base classifiers have often classified incorrectly; it then outputs a weighted vote over the outputs of the base classifiers, where each voting weight is determined as a function of how well its base classifier performed. MadaBoost modifies AdaBoost to place a cap on the weight, prior to normalization. SmoothBoost [27] caps the weight more aggressively as learning progresses, but also reweights the data and weighs the base classifiers in a manner that does not depend on how well they performed. The form of the manner in which Gavinsky's booster updates weights is significantly different from AdaBoost, and reminiscent of [13, 15].\n\n\f\nm 100 200 500 1000 100 200 500 1000\n\nn 100 100 100 100 1000 1000 1000 1000\n\nAda 0.086 0.052 0.022 0.016 0.123 0.079 0.045 0.033\n\nMada 0.077 0.045 0.018 0.014 0.119 0.072 0.039 0.026\n\nGavinsky 0.088 0.050 0.024 0.024 0.116 0.083 0.045 0.035\n\nSB(0.05) 0.071 0.067 0.056 0.063 0.093 0.071 0.050 0.048\n\nSB(0.1) 0.067 0.047 0.031 0.036 0.101 0.064 0.040 0.038\n\nSB(0.2) 0.077 0.047 0.025 0.028 0.117 0.072 0.040 0.032\n\nSB(0.4) 0.089 0.051 0.031 0.033 0.128 0.081 0.044 0.036\n\nTable 1: Average test set error rate m 100 200 500 1000 100 200 500 1000 n 100 100 100 100 1000 1000 1000 1000 Ada 13.6 19.8 32.2 37.2 13.3 19.8 28.1 36.7 Mada 8.8 13.1 20.7 19.2 7.7 11.5 16.7 20.1 Gavinsky 11.7 12.5 15.2 15.3 26.8 19.4 16.2 14.7 SB(0.05) 3.9 4.1 5.0 7.1 3.7 4.4 4.9 7.2 SB(0.1) 6.0 6.9 9.1 10.7 5.3 7.4 8.6 11.0 SB(0.2) 7.5 9.4 11.5 12.1 6.1 9.5 10.9 12.1 SB(0.4) 9.1 9.9 12.2 13.0 7.4 11.7 11.5 13.3\n\nTable 2: Average smoothness boosting as a function of the desired accuracy; instead, we ran all algorithms for 100 rounds. All boosters reweighted the data by normalizing some function that assigns weight to examples based on how well previously chosen based classifiers are doing at classifying them correctly. The booster proposed by Gavinsky might set all of these weights to zero: in such cases, it was terminated. For each choice of the number of examples m and the number of features n, we repeated the following steps: (a) generate a random target, (b) generate m random examples, (c) split them into a training set with 2/3 of the examples and a test set with the remaining 1/3, (d) apply all the algorithms on the training set, and (e) apply all the resulting classifiers on the test set. We repeated the steps enough times so that the total size of the test sets was at least 10000; that is, we repeated them 30000/m times. The average test-set error is reported. SmoothBoost [27] has two parameters, and . In his analysis, = /(2 + ), so we used the same setting. We tried his algorithm with set to each of 0.05, 0.1, 0.2 and 0.4. The test set error rates are tabulated in Table 1. MadaBoost always improved on the accuracy of AdaBoost. The results are consistent with the possibility that AdaBoost learns decision lists attributeefficiently with respect to the uniform distribution; this motivates theoretical study of whether this is true. One possible route is to prove that, for sources like this, AdaBoost is, with high probability, a smooth boosting algorithm. The average smoothnesses are given in Table 2. SmoothBoost [27] was seen to be fairly robust to the choice of ; with a good choice it sometimes performed the best. This motivates research into adaptive boosters along the lines of SmoothBoost.\n\nReferences\n[1] D. Angluin. Queries and concept learning. Machine Learning, 2:319342, 1988. [2] J. Barzdin and R. Freivald. On the prediction of general recursive functions. Soviet Mathematics Doklady, 13:12241228, 1972. [3] A. Blum. Learning Boolean functions in an infinite attribute space. In Proceedings of the Twenty-Second Annual Symposium on Theory of Computing, pages 6472, 1990. [4] A. Blum. On-line algorithms in machine learning. available at http://www.cs.cmu.edu/~avrim/Papers/pubs.html, 1996. [5] A. Blum, L. Hellerstein, and N. Littlestone. Learning in the presence of finitely or infinitely many irrelevant attributes. Journal of Computer and System Sciences, 50:3240, 1995.\n\n\f\n[6] A. Blum and P. Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1-2):245271, 1997. [7] N. Bshouty and D. Gavinsky. On boosting with optimal poly-bounded distributions. Journal of Machine Learning Research, 3:483506, 2002. [8] C. Domingo and O. Watanabe. Madaboost: a modified version of adaboost. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 180189, 2000. [9] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119139, 1997. [10] Dmitry Gavinsky. Optimally-smooth adaptive boosting and application to agnostic learning. Journal of Machine Learning Research, 4:101117, 2003. [11] A. Hajnal, W. Maass, P. Pudlak, M. Szegedy, and G. Turan. Threshold circuits of bounded depth. Journal of Computer and System Sciences, 46:129154, 1993. [12] S. Hampson and D. Volper. Linear function neurons: structure and training. Biological Cybernetics, 53:203217, 1986. [13] R. Impagliazzo. Hard-core distributions for somewhat hard problems. In Proceedings of the Thirty-Sixth Annual Symposium on Foundations of Computer Science, pages 538545, 1995. [14] J. Jackson and M. Craven. Learning sparse perceptrons. In NIPS 8, pages 654660, 1996. [15] A. Klivans and R. Servedio. Boosting and hard-core sets. Machine Learning, 53(3):217238, 2003. Preliminary version in Proc. FOCS'99. [16] A. Klivans and R. Servedio. Toward attribute efficient learning of decision lists and parities. In Proceedings of the 17th Annual Conference on Learning Theory,, pages 224238, 2004. [17] N. Littlestone. Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Machine Learning, 2:285318, 1988. [18] M. Minsky and S. Papert. Perceptrons: an introduction to computational geometry. MIT Press, Cambridge, MA, 1968. [19] T. Mitchell. Generalization as search. Artificial Intelligence, 18:203226, 1982. [20] S. Muroga, I. Toda, and S. Takasu. Theory of majority switching elements. J. Franklin Institute, 271:376 418, 1961. [21] Z. Nevo and R. El-Yaniv. On online learning of decision lists. Journal of Machine Learning Research, 3:271301, 2002. [22] V. V. Petrov. Limit theorems of probability theory. Oxford Science Publications, Oxford, England, 1995. [23] G. Pisier. Remarques sur un resultat non publi'e de B. Maurey. Sem. d'Analyse Fonctionelle, 1(12):1980 81, 1981. [24] R. Rivest. Learning decision lists. Machine Learning, 2(3):229246, 1987. [25] R. Schapire. Theoretical views of boosting. In Proc. 10th ALT, pages 1224, 1999. [26] R. Servedio. On PAC learning using Winnow, Perceptron, and a Perceptron-like algorithm. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, pages 296307, 1999. [27] R. Servedio. Smooth boosting and learning with malicious noise. Journal of Machine Learning Research, 4:633648, 2003. Preliminary version in Proc. COLT'01. [28] R. Servedio. Every linear threshold function has a low-weight approximator. In Proceedings of the 21st Conference on Computational Complexity (CCC), pages 1830, 2006. [29] L. Valiant. Projection learning. Machine Learning, 37(2):115130, 1999.\n\n\f\n", "award": [], "sourceid": 3007, "authors": [{"given_name": "Philip", "family_name": "Long", "institution": null}, {"given_name": "Rocco", "family_name": "Servedio", "institution": null}]}