{"title": "Human Rademacher Complexity", "book": "Advances in Neural Information Processing Systems", "page_first": 2322, "page_last": 2330, "abstract": "We propose to use Rademacher complexity, originally developed in computational learning theory, as a measure of human learning capacity. Rademacher complexity measures a learners ability to fit random data, and can be used to bound the learners true error based on the observed training sample error. We first review the definition of Rademacher complexity and its generalization bound. We then describe a learning the noise\" procedure to experimentally measure human Rademacher complexities. The results from empirical studies showed that: (i) human Rademacher complexity can be successfully measured, (ii) the complexity depends on the domain and training sample size in intuitive ways, (iii) human learning respects the generalization bounds, (iv) the bounds can be useful in predicting the danger of overfitting in human learning. Finally, we discuss the potential applications of human Rademacher complexity in cognitive science.\"", "full_text": "Human Rademacher Complexity\n\nXiaojin Zhu1, Timothy T. Rogers2, Bryan R. Gibson1\nDepartment of {1Computer Sciences, 2Psychology}\nUniversity of Wisconsin-Madison. Madison, WI 15213\n\njerryzhu@cs.wisc.edu, ttrogers@wisc.edu, bgibson@cs.wisc.edu\n\nAbstract\n\nWe propose to use Rademacher complexity, originally developed in computational\nlearning theory, as a measure of human learning capacity. Rademacher complex-\nity measures a learner\u2019s ability to \ufb01t random labels, and can be used to bound\nthe learner\u2019s true error based on the observed training sample error. We \ufb01rst re-\nview the de\ufb01nition of Rademacher complexity and its generalization bound. We\nthen describe a \u201clearning the noise\u201d procedure to experimentally measure human\nRademacher complexities. The results from empirical studies showed that: (i)\nhuman Rademacher complexity can be successfully measured, (ii) the complex-\nity depends on the domain and training sample size in intuitive ways, (iii) hu-\nman learning respects the generalization bounds, (iv) the bounds can be useful in\npredicting the danger of over\ufb01tting in human learning. Finally, we discuss the\npotential applications of human Rademacher complexity in cognitive science.\n\n1 Introduction\n\nMany problems in cognitive psychology arise from questions of capacity. How much information\ncan human beings hold in mind and deploy in simple memory tasks [19, 15, 6]? What kinds of\nfunctions can humans easily acquire when learning to classify items [29, 7], and do they have bi-\nases for learning some functions over others[10]? Is there a single domain-general answer to these\nquestions, or is the answer domain-speci\ufb01c [28]? How do human beings avoid over-\ufb01tting learning\nexamples when acquiring knowledge that allows them to generalize [20]? Such questions are central\nto a variety of research in cognitive psychology, but only recently have they begun to be placed on a\nformal mathematical footing [7, 9, 5].\n\nMachine learning offers a variety of formal approaches to measuring the capacity of a learning sys-\ntem, with concepts such as Vapnik-Chervonenkis (VC) dimension [27, 25, 12] and Rademacher\ncomplexity [1, 13, 24]. Based on these notions of capacity, one can quantify the generalization\nperformance of a classi\ufb01er, and the danger of over-\ufb01tting, by bounding its future test error using\nits observed training sample error.\nIn this paper, we show how one such concept\u2013Rademacher\ncomplexity\u2013can be measured in humans, based on their performance in a \u201clearning the noise\u201d pro-\ncedure. We chose Rademacher complexity (rather than the better-known VC dimension) because\nit is particularly amenable to experimental studies, as discussed in Section 5. We assess whether\nhuman capacity varies depending on the nature of the materials to be categorized, and empirically\ntest whether human generalization behavior respects the error bounds in a variety of categorization\ntasks. The results validate Rademacher complexity as a meaningful measure of human learning\ncapacity, and provide a new perspective on the human tendency to over\ufb01t training data in category\nlearning tasks. We note that our aim is not to develop a new formal approach to complexity, but\nrather to show how a well-studied formal measure can be computed for human beings.\n\n1\n\n\f2 Rademacher Complexity\n\nBackground and de\ufb01nitions. Let X be a domain of interest, which in psychology corresponds to a\nstimulus space. For example, X could be an in\ufb01nite set of images parametrized by some continuous\nparameters, a \ufb01nite set of words, etc.. We will use x \u2208 X to denote an instance (e.g., an image\nor a word) from the domain; precisely how x is represented is immaterial. We assume there is an\nunderlying marginal distribution PX on X , such that x is sampled with probability PX(x) during\ntraining and testing. For example, PX can be uniform on the set of words.\nLet f : X 7\u2192 R be a real-valued function. This corresponds to a hypothesis that predicts the label\nof any instance in the domain. The label can be a continuous value for regression, or {\u22121, 1} for\nbinary classi\ufb01cation. Let F be the set of such functions, or the hypothesis space, that we consider.\nFor example, in machine learning F may be the set of linear classi\ufb01ers. In the present work, we will\ntake F to be the (possibly in\ufb01nite) set of hypotheses from X to binary classes {\u22121, 1} that humans\ncan come up with.\nRademacher complexity (see for example [1]) measures the capacity of the hypothesis space F by\nhow easy it is for F to \ufb01t random noise. Consider a sample of n instances: x1, . . . , xn drawn\ni.i.d. from PX. Now generate n random numbers \u03c31, . . . , \u03c3n, each taking value -1 or 1 with equal\ni=1 \u03c3if(xi)|.\nThis is easier to understand when f produces -1, 1 binary labels. In this case, the \u03c3\u2019s can be thought\nof as random labels, and {(xi, \u03c3i)}n\ni=1 as a training sample. The \ufb01t measures how f\u2019s predictions\nmatch the random labels on the training sample: if f perfectly predicts the \u03c3\u2019s, or completely the\nopposite by \ufb02ipping the classes, then the \ufb01t is maximized at n; if f\u2019s predictions are orthogonal to\nthe \u03c3\u2019s, the \ufb01t is minimized at 0.\n\nprobability. For a given function f \u2208 F, its \ufb01t to the random numbers is de\ufb01ned as |Pn\n\nThe \ufb01t of a set of functions F is de\ufb01ned as supf\u2208F |Pn\n\ni=1 \u03c3if(xi)|. That is, we are \ufb01tting the\nparticular training sample by \ufb01nding the hypothesis in F with the best \ufb01t. If F is rich, it will be\neasier to \ufb01nd a hypothesis f \u2208 F that matches the random labels, and its \ufb01t will be large. On the\nother hand, if F is simple (e.g., in the extreme containing only one function f), it is unlikely that\nf(xi) will match \u03c3i, and its \ufb01t will be close to zero.\nFinally, recall that {(xi, \u03c3i)}n\ni=1 is a particular random training sample. If, for every random training\nsample of size n, there always exists some f \u2208 F (which may be different each time) that matches\nit, then F is very good at \ufb01tting random noise. This also means that F is prone to over\ufb01tting, whose\nvery de\ufb01nition is to learn noise. This is captured by taking the expectation over training samples:\nDe\ufb01nition 1 (Rademacher Complexity). For a set of real-valued functions F with domain X , a\ndistribution PX on X , and a size n, the Rademacher complexity R(F,X , PX , n) is\n\nR(F,X , PX , n) = Ex\u03c3\n\n\u03c3if(xi)\n\n,\n\n(1)\n\n\"\n\nsup\nf\u2208F\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 2\n\nn\n\nnX\n\ni=1\n\n#\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n2 , 1\n\niid\u223c PX, and \u03c3 = \u03c31, . . . , \u03c3n\n\niid\u223c Bernoulli( 1\n\n2) with\n\nwhere the expectation is over x = x1, . . . , xn\nvalues \u00b11.\nRademacher complexity depends on the hypothesis space F, the domain X , the distribution on the\ndomain PX, as well as the training sample size n. The size n is relevant because for a \ufb01xed F,\nit will be increasingly dif\ufb01cult to \ufb01t random noise as n gets larger. On the other hand, it is worth\nnoting that Rademacher complexity is independent of any future classi\ufb01cation tasks. For example,\nin Section 4 we will discuss two different tasks on the same X (set of words): classifying a word by\nits emotional valence, or by its length. These two tasks will share the same Rademacher complexity.\nIn general, the value of Rademacher complexity will depend on the range of F. In the special case\nwhen F is a set of functions mapping x to {\u22121, 1}, R(F,X , PX , n) is between 0 and 2.\nA particularly important property of Rademacher complexity is that it can be estimated from random\nsamples. Let {(x(1)\ni=1 be m random samples of size n each. In\nSection 3, these will correspond to m different subjects. The following theorem is an extension of\nTheorem 11 in [1]. The proof follows from McDiarmid\u2019s inequality [16].\n\n)}n\ni=1, . . . ,{(x(m)\n\n, \u03c3(m)\n\n, \u03c3(1)\n\n)}n\n\ni\n\ni\n\ni\n\ni\n\n2\n\n\fTheorem 1. Let F be a set of functions mapping to [\u22121, 1]. For any integers n, m,\n\n\uf8f1\uf8f2\uf8f3\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)R(F,X , PX , n) \u2212 1\n\nm\n\nmX\n\nj=1\n\nP\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 2\n\nn\n\nnX\n\ni=1\n\nsup\nf\u2208F\n\ni f(x(j)\n\u03c3(j)\ni )\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2265 \u0001\n\uf8fc\uf8fd\uf8fe \u2264 2 exp\n\n(cid:18)\n\n(cid:19)\n\n(2)\n\n\u2212 \u00012nm\n\n8\n\nTheorem 1 allows us to estimate the expectation in (1) with random samples, which is of practical\nimportance. It remains to compute the supremum in (1). In Section 3, we will discuss our procedure\nto approximate the supremum in the case of human learning.\nGeneralization Error Bound. We state a generalization error bound by Bartlett and Mendelson\n(Theorem 5 in [1]) as an important application of Rademacher complexity. Consider any binary\nclassi\ufb01cation task of predicting a label in Y = {\u22121, 1} for x \u2208 X . For example, the label y could be\nthe emotional valence (positive=1 vs. negative=-1) of a word x. In general, a binary classi\ufb01cation\ntask is characterized by a joint distribution PXY on X \u00d7 {\u22121, 1}. Training data and future test\ndata consist of instance-label pairs (x, y) iid\u223c PXY . Let F be a set of binary classi\ufb01ers that map\nPn\nX to {\u22121, 1}. For f \u2208 F, let (y 6= f(x)) be an indicator function which is 1 if y 6= f(x), and 0\notherwise. On a training sample {(xi, yi)}n\ni=1 of size n, the observed training sample error of f is\ni=1(yi 6= f(xi)). The more interesting quantity is the true error of f, i.e., how well\n\u02c6e(f) = 1\n[(y 6= f(x))]. Rademacher complexity\nf can generalize to future test data: e(f) = E\nn\nallows us to bound the true error using training sample error as follows.\nTheorem 2. (Bartlett and Mendelson) Let F be a set of functions mapping X to {\u22121, 1}. Let\nPXY be a probability distribution on X \u00d7 {\u22121, 1} with marginal distribution PX on X . Let\n{(xi, yi)}n\niid\u223c PXY be a training sample of size n. For any \u03b4 > 0, with probability at least\n1 \u2212 \u03b4, every function f \u2208 F satis\ufb01es\n\n(x,y)iid\u223c PXY\n\ni=1\n\ne(f) \u2212 \u02c6e(f) \u2264 R(F,X , PX , n)\n\n2\n\n+\n\n.\n\n(3)\n\nrln(1/\u03b4)\n\n2n\n\nThe probability 1 \u2212 \u03b4 is over random draws of the training sample. That is, if one draws a large\nnumber of training samples of size n each, then (3) is expected to hold on 1 \u2212 \u03b4 fraction of those\nsamples. The bound has two factors, one from the Rademacher complexity and the other from\nthe con\ufb01dence parameter \u03b4 and training sample size n. When the bound is tight, training sample\nerror is a good indicator of true error, and we can be con\ufb01dent that over\ufb01tting is unlikely. A tight\nbound requires the Rademacher complexity to be close to zero. On the other hand, if the Rademacher\ncomplexity is large, or n is too small, or the requested con\ufb01dence 1\u2212 \u03b4 is overly stringent, the bound\ncan be loose. In that case, there is a danger of over\ufb01tting. We will demonstrate this generalization\nerror bound on four different human classi\ufb01cation tasks in Section 4.\n\n3 Measuring Human Rademacher Complexity by Learning the Noise\n\nOur aim is to measure the Rademacher complexity of the human learning system for a given stimulus\nspace X , distribution of instances PX, and sample-size n. By \u201chuman learning system,\u201d we mean\nthe set of binary classi\ufb01cation functions that an average human subject can come up with on the\ndomain X , under the experiment conditions described below. We will denote this set of functions F\nwith Ha, that is, \u201caverage human.\u201d\nWe make two assumptions. The \ufb01rst is the assumption of universality [2]: every individual has the\nsame Ha. It allows us to pool subjects together. This assumption can be loosened in the future.\nFor instance, F could be de\ufb01ned as the set of functions that a particular individual or group can\nemploy in the learning task, such as a given age-group, education level, or other special population.\nA second assumption is required to compute the supremum on Ha. Obviously, we cannot measure\nand compare the performance of every single function contained in Ha. Instead, we assume that,\nwhen making their classi\ufb01cation judgments, participants use the best function at their disposal\u2013so\nthat the errors they make when tested on the training instances re\ufb02ect the error generated by the\nbest-performing function in Ha, thus providing a direct measure of the supremum. In essence, the\nassumption is that participants are doing their best to perform the task.\n\n3\n\n\fi\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\nn\n\n(cid:12)(cid:12) 2\n\nWith the two assumptions above, we can compute human Rademacher complexity for a given\nstimulus domain X , distribution PX, and sample size n, by assessing how well human partici-\npants are able to learn randomly-assigned labels. Each participant is presented with a training\ni=1 where the \u03c3\u2019s are random \u00b11 labels, and asked to learn the instance-label\nsample {(xi, \u03c3i)}n\nPn\nmapping. The subject is not told that the labels are random. We assume that the subject will\nsearch within Ha for the best hypothesis (\u201crule\u201d), which is the one that minimizes training error:\ni=1 \u03c3if(xi) = argminf\u2208Ha \u02c6e(f). We do not directly observe f\u2217. Later, we\nf\u2217 = argmaxf\u2208Ha\nask the subject to classify the same training instances {xi}n\ni=1 using what she has learned. Her clas-\nsi\ufb01cation labels will be f\u2217(x1), . . . , f\u2217(xn), which we do observe. We then approximate the supre-\nmum as follows: supf\u2208Ha\ncomplexity to re\ufb02ect actual learning capacity on the set Ha, it is important to prevent participants\nfrom simply doing rote learning. With these considerations, we propose the following procedure to\nestimate human Rademacher complexity.\nProcedure. Given domain X , distribution PX, training sample size n, and number of subjects m,\nwe generate m random samples of size n each: {(x(1)\ni=1, where\n2) with value \u00b11, for j = 1 . . . m. The procedure is paper-\nx(j)\ni\nand-pencil based, and consists of three steps:\n\ni=1 \u03c3if\u2217(xi)(cid:12)(cid:12). For the measured Rademacher\nPn\n\ni=1 \u03c3if(xi)(cid:12)(cid:12) \u2248(cid:12)(cid:12) 2\nPn\n\nStep 1. Participant j is shown a printed sheet with the training sample {(x(j)\n\ni=1, where\neach instance x(j)\n(shown as \u201cA\u201d and \u201cB\u201d instead of -1,1 for\nconvenience). the participant is informed that there are only two categories; the order does not mat-\nter; they have three minutes to study the sheet; and later they will be asked to use what they have\nlearned to categorize more instances into \u201cA\u201d or \u201cB\u201d.\n\nis paired with its random label \u03c3(j)\n\n)}n\ni=1, . . . ,{(x(m)\n\niid\u223c PX and \u03c3(j)\n\niid\u223c Bernoulli( 1\n\n, \u03c3(m)\n\n, \u03c3(1)\n\n, \u03c3(j)\n\n)}n\n\n)}n\n\n2 , 1\n\nn\n\ni\n\ni\n\nn\n\nm\n\nj=1\n\ni }n\n\n(cid:12)(cid:12)(cid:12) 2\n\ni=1 \u03c3(j)\n\nPn\n\nWe estimate\n\nPm\n\ni f (j)(x(j)\ni )\n\n1 ), . . . , f (j)(x(j)\n\n(encoded as \u00b11).\n\nStep 2. After three minutes the sheet is taken away. To prevent active maintenance of training\nitems in working memory the participant performs a \ufb01ller task consisting of ten two-digit addi-\ntion/subtraction questions.\n\nn ) be subject j\u2019s answers\n\nStep 3. The participant is given another sheet with the same training instances {x(j)\n\ni=1 but no\nlabels. The order of the n instances is randomized and different from step 1. The participant is not\ntold that they are the same training instances, and is asked to categorize each instance as \u201cA\u201d or \u201cB\u201d\nand is encouraged to guess if necessary. There is no time limit.\n\n(cid:12)(cid:12)(cid:12). We also conduct a post-experiment inter-\n\nLet f (j)(x(j)\nR(Ha,X , PX , n) by 1\nview where the subject reports any insights or hypotheses they may have on the categories.\nMaterials To instantiate the general procedure, one needs to specify the domain X and an associated\nmarginal distribution PX. For simplicity, in all our experiments PX is the uniform distribution over\nthe corresponding domain. We conducted experiments on example domains. They are not of spe-\nci\ufb01c interest in themselves but nicely illustrate many interesting properties of human Rademacher\ncomplexity: (1) The \u201cShape\u201d Domain. X consists of 321 computer-generated 3D shapes [3]. The\nshapes are parametrized by a real number x \u2208 [0, 1], such that small x produces spiky shapes, while\nlarge x produces smooth ones. A few instances and their parameters are shown in Figure 1(a). It\nis important to note that this internal structure is unnecessary to the de\ufb01nition or measurement of\nRademacher complexity per se. However, in Section 4 we will de\ufb01ne some classi\ufb01cation tasks that\nutilize this internal structure. Participants have little existing knowledge about this domain. (2) The\n\u201cWord\u201d Domain. X consists of 321 English words. We start with the Wisconsin Perceptual At-\ntribute Ratings Database [18], which contains words rated by 350 undergraduates for their emotional\nvalence. We sort the words by their emotion valence, and take the 161 most positive and the 160\nmost negative ones for use in the study. A few instances and their emotion ratings are shown in\nFigure 1(b). Participants have rich knowledge about this domain. The size of the domain for shapes\nand words was matched to facilitate comparison.\nParticipants were 80 undergraduate students, participating for partial course credit. They were\ndivided evenly into eight groups. Each group of m = 10 subjects worked on a unique combination\nof the Shape or the Word domain, and training sample size n in 5, 10, 20, or 40, using the procedure\nde\ufb01ned previously.\n\n4\n\n\f1/4\n\n3/4\n\n0\n1\n(a) examples from the Shape domain\n\n1/2\n\nrape\n-5.60\n\nkiller\n-5.55\n\nfuneral\n-5.47\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\nfun\n4.91\n\nlaughter\n\n4.95\n\njoy\n5.19\n\n(b) examples from the Word domain\n\n(c) R(Ha, Shape, uniform, n)\n\n(d) R(Ha, Word, uniform, n)\n\nFigure 1: Human Rademacher complexity on the \u201cShape\u201d and \u201cWord\u201d domains.\n\nResults. Figures 1(c,d) show the measured human Rademacher complexities on the domains\nX =Shape and Word respectively, with distribution PX=uniform, and with different training sam-\nple sizes n. The error bars are 95% con\ufb01dence intervals. Several interesting observations can be\nmade from the data:\n\nObservation 1: human Rademacher complexities in both domains decrease as n increases. This is\nanticipated, as it should be harder to learn a larger number of random labels. Indeed, when n = 5,\nour interviews show that, in both domains, 9 out of 10 participants offered some spurious rules\nof the random labels. For example, one subject thought the shape categories were determined by\nwhether the shape \u201cfaces\u201d downward; another thought the word categories indicated whether the\nword contains the letter T. Such beliefs, though helpful in learning the particular training samples,\namount to over-\ufb01tting the noise. In contrast, when n = 40, about half the participants indicated that\nthey believed the labels to be random, as spurious \u201crules\u201d are more dif\ufb01cult to \ufb01nd.\n\nObservation 2: human Rademacher complexities are signi\ufb01cantly higher in the Word domain than\nin the Shape domain, for n = 10, 20, 40 respectively (t-tests, p < 0.05). The higher complexity\nindicates that, for the same sample sizes, participants are better able to \ufb01nd spurious explanations of\nthe training data for the Words than for the Shapes. Two distinct strategies were apparent in the Word\ndomain interviews: (i) Some participants created mnemonics. For example, one subject received the\ntraining sample (grenade, B), (skull, A), (con\ufb02ict, A), (meadow, B), (queen, B), and came up with\nthe following story: \u201ca queen was sitting in a meadow and then a grenade was thrown (B = before),\nthen this started a con\ufb02ict ending in bodies & skulls (A = after).\u201d (ii) Other participants came up\nwith idiosyncratic, but often imperfect, rules. For instance, whether the item \u201ctastes good,\u201d \u201crelates\nto motel service,\u201d or \u201cphysical vs. abstract.\u201d We speculate that human Rademacher complexities on\nother domains can be drastically different too, re\ufb02ecting the richness of the participant\u2019s pre-existing\nknowledge about the domain.\n\nObservation 3: many of these human Rademacher complexities are relatively large. This means that\nunder those X , PX , n, humans have a large capacity to learn arbitrary labels, and so will be more\nprone to over\ufb01t on real (i.e., non-random) tasks. We will present human generalization experiments\nin Section 4. It is also interesting to note that both Rademacher complexities at n = 5 are less than 2:\nunder our procedure, participants are not perfect at remembering the labels of merely \ufb01ve instances.\n\n4 Bounding Human Generalization Errors\n\nWe reiterate the interpretation of human Rademacher complexity for psychology. It does not predict\n\u02c6e (how well humans perform when training for a given task). Instead, Theorem 2 bounds e \u2212 \u02c6e,\nthe \u201camount of over\ufb01tting\u201d (sometimes also called \u201cinstability\u201d) when the subject switches from\ntraining to testing. A tight (close to 0) bound guarantees no severe over\ufb01tting: humans\u2019 future\n\n5\n\n01020304000.511.52nRademacher complexity01020304000.511.52nRademacher complexity\fTable 1: Human learning performance abides by the generalization error bounds.\n\ncondition\nShape-+\nn=5\n\nShape-+\nn=40\n\nShape-+-\nn=5\n\nShape-+-\nn=40\n\nID\n81\n82\n83\n84\n85\n86\n87\n88\n89\n90\n91\n92\n93\n94\n95\n96\n97\n98\n99\n100\n\n\u02c6e\n0.00\n0.00\n0.00\n0.00\n0.00\n0.05\n0.03\n0.03\n0.00\n0.00\n0.00\n0.00\n0.00\n0.00\n0.20\n0.12\n0.32\n0.15\n0.15\n0.03\n\nbound e\n1.35\n1.35\n1.35\n1.35\n1.35\n0.39\n0.36\n0.36\n0.34\n0.34\n1.35\n1.35\n1.35\n1.35\n1.55\n0.46\n0.66\n0.49\n0.49\n0.36\n\ne\n0.05\n0.22\n0.10\n0.09\n0.07\n0.04\n0.14\n0.03\n0.04\n0.01\n0.23\n0.27\n0.21\n0.40\n0.18\n0.16\n0.50\n0.08\n0.11\n0.10\n\ncondition\nWordEmotion\nn=5\n\nWordEmotion\nn=40\n\nWordLength\nn=5\n\nWordLength\nn=40\n\nID\n101\n102\n103\n104\n105\n106\n107\n108\n109\n110\n111\n112\n113\n114\n115\n116\n117\n118\n119\n120\n\n\u02c6e\n0.00\n0.00\n0.00\n0.00\n0.00\n0.70\n0.00\n0.00\n0.62\n0.00\n0.00\n0.00\n0.00\n0.00\n0.00\n0.12\n0.45\n0.00\n0.15\n0.15\n\nbound e\n1.43\n1.43\n1.43\n1.43\n1.43\n1.23\n0.53\n0.53\n1.15\n0.53\n1.43\n1.43\n1.43\n1.43\n1.43\n0.65\n0.98\n0.53\n0.68\n0.68\n\ne\n0.58\n0.46\n0.04\n0.03\n0.31\n0.65\n0.04\n0.00\n0.53\n0.05\n0.46\n0.69\n0.55\n0.26\n0.57\n0.51\n0.55\n0.00\n0.29\n0.37\n\ntest performance e will be close to their training performance \u02c6e. This does not mean they will do\nwell: \u02c6e could be large and thus e is similarly large. A loose bound, in contrast, is a warning sign for\nover\ufb01tting: good training performance (small \u02c6e) may not re\ufb02ect learning of the correct categorization\nrule, and so does not entail good performance on future samples (i.e., e can be much larger than \u02c6e).\nWe now present four non-random category-learning tasks to illustrate these points.\nMaterials. We consider four very different binary classi\ufb01cation tasks to assess whether Theorem 2\nholds for all of them. The tasks are: (1) Shape-+: Recall the Shape domain is parametrized by\nx \u2208 [0, 1]. The task has a linear decision boundary at x = 0.5, i.e., P (y = 1|x) = 0 if x < 0.5,\nand 1 if x \u2265 0.5. It is well-known that people can easily learn such boundaries, so this is a fairly\neasy task on the domain. (2) Shape-+-: This task is also on the Shape domain, but with a nonlinear\ndecision boundary. The negative class is on both ends while the positive class is in the middle:\nP (y = 1|x) = 0 if x \u2208 [0, 0.25) \u222a (0.75, 1], and 1 if x \u2208 [0.25, 0.75]. Prior research suggests that\npeople have dif\ufb01culty learning nonlinearly separable categories [28, 7], so this is a harder task. Note,\nhowever, that the two shape tasks share the same Rademacher complexity, and therefore have the\nsame bound for the same n. (3) WordEmotion: This task is on the Word domain. P (y = 1|x) = 0\nif word x has a negative emotion rating in the Wisconsin Perceptual Attribute Ratings Database, and\nP (y = 1|x) = 1 otherwise. (4) WordLength: P (y = 1|x) = 0 if word x has 5 or less letters,\nand P (y = 1|x) = 1 otherwise. The two word tasks are drastically different in that one focuses on\nsemantics and the other on orthography, but they share the same Rademacher complexity and thus\nthe same bound (for the same n), because the underlying domain is the same.\nProcedure. The procedure is identical to that in Section 3 except for two things:\n(i) Instead\nof random labels \u03c3, we sample labels y iid\u223c P (y|x) appropriate for each task.\n(ii) In step 3,\nin addition to the training instances {x(j)\ni=1, the jth subject is also given 100 test instances\n{x(j)\ni=n+1, sampled from PX. The order of the training and test instances is randomized.\nThe true test labels y are sampled from P (y|x). We compute the participant\u2019s training sam-\n, and estimate her generalization error as\n\ni }n+100\n\n(cid:17)\n\nple error as \u02c6e(f (j)) = 1/nPn\ne(f (j)) = 1/100Pn+100\n\n(cid:16)\n\ni }n\n(cid:17)\n\n(cid:16)\n\nyi 6= f (j)(x(j)\ni )\nyi 6= f (j)(x(j)\ni )\n\ni=1\n\n.\n\ni=n+1\n\nParticipants were 40 additional students, randomly divided into 8 groups of \ufb01ve each. Each group\nworked on one of the four tasks, with training sample size n=5 or 40.\nResults. We present the performance of individual participants in Table 1: \u02c6e is the observed train-\ning error for that subject, \u201cbound e\u201d is the 95% con\ufb01dence (i.e., \u03b4 = 0.05) bound on test error:\n\n6\n\n\fFigure 2: Human Rademacher complexity predicts the trend of over\ufb01tting.\n\n\u02c6e(f) + R(F,X , PX , n)/2 +pln(1/\u03b4)/2n, and e is the observed test error. We also present the\n\naggregated results across subjects and tasks in Figure 2, comparing the bound on e\u2212 \u02c6e (the \u201camount\nof over\ufb01tting,\u201d RHS of (3)) vs. the observed e\u2212 \u02c6e, as the underlying Rademacher complexity varies.\nWe make two more observations:\n\nObservation 4: Theorem 2 holds for every participant. Table 1 provides empirical support that our\napplication of computational learning theory to human learning is viable. In fact, for our choice of\n\u03b4 = 0.05, Theorem 2 allows the bound to fail on about two (5% of 40) participants \u2013 which did\nnot happen. Of course, some of the \u201cbound e\u201d are vacuous (greater than 1) as it is well-known that\nbounds in computational learning theory are not always tight [14], but others are reasonably tight\n(e.g., on Shape-+ with n = 40).\nObservation 5: the larger the Rademacher complexity, the worse the actual amount of over\ufb01tting\ne \u2212 \u02c6e. Figure 2 shows that as R increases, e \u2212 \u02c6e increases (solid line; error bar \u00b1standard error;\naveraged over the two different tasks with the same domain and n, as noted in the graph). The bound\non e \u2212 \u02c6e (dotted line; RHS of (3)) has the same trend, although, being loose, it is higher up. This\nseems to be true regardless of the classi\ufb01cation task. For example, the Word domain and n = 5 has\na large Rademacher complexity 1.76, and both task WordLength and task WordEmotion severely\nover\ufb01t: In task WordLength with n = 5, all subjects had zero training error but had large test error,\nsuggesting that their good performance on the training items re\ufb02ects over\ufb01tting. Accordingly, the\nexplanations offered during the post-test interviews for this group spuriously \ufb01t the training items\nbut did not re\ufb02ect the true categorization rule. Subject 111 thought that the class decision indicated\n\u201cthings you can go inside,\u201d while subject 114 thought the class indicated an odd or even number of\nsyllables. Similarly, on task WordEmotion with n = 5, three out of \ufb01ve subjects over\ufb01t the training\nitems. Subject 102 received the training items (daylight, 1), (hospital, -1), (termite, -1), (envy, -1),\n(scream, -1), and concluded that class 1 is \u201canything related to omitting[sic] light,\u201d and proceeded\nto classify the test items as such.\n\n5 Discussions and Future Work\n\nIs our study on memory or learning? This distinction is not necessarily relevant here, as we adopt\nan abstract perspective which analyzes the human system as a black box that produces labels, and\nboth learning and memory contribute to the process being executed in that black box. We do have\nevidence from post-interviews that Figure 1 does not merely re\ufb02ect list-length effects from memory\nstudies: (i) participants treated the study as a category-learning and not a memory task \u2013 they were\nnot told that the training and test items are the same when we estimate R; (ii) the memory load was\nidentical in the shape and the word domains, yet the curves differ markedly; (iii) participants were\nable to articulate the \u201crules\u201d they were using to categorize the items.\n\nMuch recent research has explored the relationship between the statistical complexity of some cate-\ngorization task and the ease with which humans learn the task [7, 5, 9, 11]. Rademacher complexity\nis different: it indexes not the complexity of the X 7\u2192 Y categorization task, but the sophistication\nof the learner in domain X (note Y does not appear in Rademacher complexity). Greater complex-\nity indicates, not a more dif\ufb01cult categorization task, but a greater tendency to over\ufb01t sparse data.\n\n7\n\n00.511.5200.511.5Rademacher complexitye \u2212 e^ Shape,40Word,40Shape,5Word,5boundobserved\fOn the other hand, our de\ufb01nition of Rademacher complexity depends only on the domain, distribu-\ntion, and sample size. In human learning, other factors also contribute to learnability, such as the\ninstructions, motivation, time to study, and should probably be incorporated into the complexity.\n\nHuman Rademacher complexity has interesting connections to other concepts.\nThe VC-\ndimension [27, 25, 12] is another capacity measure. Let {x1, . . . , xm} \u2286 X be a subset of\nthe domain. Let (f(x1), . . . , f(xm)) be a \u00b11-valued vector which is the classi\ufb01cations made\nby some f \u2208 F.\nIf F is rich enough such that its members can produce all 2m vectors:\n{(f(x1), . . . , f(xm)) : f \u2208 F} = {\u22121, 1}m, then we say that the subset is shattered by F. The\nVC-dimension of F is the size of the largest subset that can be shattered by F, or \u221e if F can shatter\narbitrarily large subsets. Unfortunately, human VC-dimension seems dif\ufb01cult to measure experi-\nmentally: First, shattering requires validating an exponential (2m) number of classi\ufb01cations on a\ngiven subset. Second, to determine that the VC-dimension is m, one needs to show that no subset\nof size m + 1 can be shattered. However, the number of such subsets can be in\ufb01nite. A variant,\n\u201ceffective VC-dimension\u201d, may be empirically estimated from a training sample [26]. This is left\nfor future research. Normalized Maximum Likelihood (NML) uses a similar complexity measure\nfor a model class [21], the connection merits further study ([23], p.50).\n\nHuman Rademacher complexity might help to advance theories of human cognition in many ways.\nFirst, human Rademacher complexity can provide a means of testing computational models of hu-\nman concept learning. Traditionally, such models are assessed by comparing their performance to\nhuman performance in terms of classi\ufb01cation error. A new approach would be to derive or empiri-\ncally estimate the Rademacher complexity of the computational models, and compare that to human\nRademacher complexity. A good computational model should match humans in this regard.\n\nSecond, our procedure could be used to measure human Rademacher complexity in individuals or\nspecial populations, including typically and atypically-developing children and adults. Relating in-\ndividual Rademacher complexity to standard measures of learning ability or IQ may shed light on\nthe relationship between complexity, learning, and intelligence. Many IQ tests measure the partici-\npant\u2019s ability to generalize the pattern in words or images. Individuals with very high Rademacher\ncomplexity may actually perform worse by being \u201cdistracted\u201d by other potential hypotheses.\n\nThird, human Rademacher complexity may help explain the human tendency to discern patterns in\nrandom stimuli, such as the well-known Rorschach inkblot test, \u201cillusory correlations\u201d [4], or \u201cfalse-\nmemory\u201d effect [22]. These effects may be viewed as spurious rule-\ufb01tting to (or generalization of)\nthe observed data, and Human Rademacher complexity may quantify the possibility of observing\nsuch an effect.\n\nFourth, cognitive psychologists have long entertained an interest in characterizing the capacity of\ndifferent mental processes such as, for instance, the capacity limitations of short-term memory [19,\n6]. In this vein, our work suggests a different kind of metric for assessing the capacity of the human\nlearning system.\n\nFinally, human Rademacher complexity can help experimental psychologists to determine the\npropensity of over\ufb01tting in their stimulus materials. We have seen that human Rademacher complex-\nity can be much higher in some domains (e.g. Word) than others (e.g. Shape). Our procedure could\nbe used to measure the human Rademacher complexity of many standard concept-learning materials\nin cognitive science, such as the Greebles used by Tarr and colleagues [8] and the circle-and-line\nstimuli of McKinley & Nosofsky [17].\nAcknowledgment: We thank the reviewers for their helpful comments. XZ thanks Michael Coen for discus-\nsions that lead to the realization of the dif\ufb01culties in measuring human VC dimension. This work is supported\nin part by AFOSR grant FA9550-09-1-0313 and the Wisconsin Alumni Research Foundation.\n\nReferences\n\n[1] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural\n\nresults. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[2] A. Caramazza and M. McCloskey. The case for single-patient studies. Cognitive Neuropsychology,\n\n5(5):517\u2013527, 1988.\n\n[3] R. Castro, C. Kalish, R. Nowak, R. Qian, T. Rogers, and X. Zhu. Human active learning. In Advances in\n\nNeural Information Processing Systems (NIPS) 22. 2008.\n\n8\n\n\f[4] L. J. Chapman.\n\nIllusory correlation in observational report. Journal of Verbal Learning and Verbal\n\nBehavior, 6:151\u2013155, 1967.\n\n[5] N. Chater and P. Vitanyi. Simplicity: A unifying principle in cognitive science? Trends in Cognitive\n\nScience, 7(1):19\u201322, 2003.\n\n[6] N. Cowan. The magical number 4 in short-term memory: A reconsideration of mental storage capacity.\n\nBehavioral and Brain Sciences, 24:87\u2013185, 2000.\n\n[7] J. Feldman. Minimization of boolean complexity in human concept learning. Nature, 407:630\u2013633, 2000.\n[8] I. Gauthier and M. Tarr. Becoming a \u201dgreeble\u201d expert: Exploring mechanisms for face recognition. Vision\n\nResearch, 37(12):1673\u20131682, 1998.\n\n[9] N. Goodman, J. B. Tenenbaum, J. Feldman, and T. L. Grif\ufb01ths. A rational analysis of rule-based concept\n\nlearning. Cognitive Science, 32(1):108\u2013133, 2008.\n\n[10] T. L. Grif\ufb01ths, B. R. Christian, and M. L. Kalish. Using category structures to test iterated learning as a\n\nmethod for identifying inductive biases. Cognitive Science, 32:68\u2013107, 2008.\n\n[11] T. L. Grif\ufb01ths and J. B. Tenenbaum. From mere coincidences to meaningful discoveries. Cognition,\n\n103(2):180\u2013226, 2007.\n\n[12] M. J. Kearns and U. V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994.\n[13] V. I. Koltchinskii and D. Panchenko. Rademacher processes and bounding the risk of function learning.\nIn E. Gine, D. Mason, and J. Wellner, editors, High Dimensional Probability II, pages 443\u2013459. MIT\nPress, 2000.\n\n[14] J. Langford. Tutorial on practical prediction theory for classi\ufb01cation. Journal of Machine Learning\n\nResearch, 6:273\u2013306, 2005.\n\n[15] R. L. Lewis. Interference in short-term memory: The magical number two (or three) in sentence process-\n\ning. Journal of Psycholinguistic Research, 25(1):93\u2013115, 1996.\n\n[16] C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics 1989, pages 148\u2013\n\n188. Cambridge University Press, 1989.\n\n[17] S. C. McKinley and R. M. Nosofsky. Selective attention and the formation of linear decision boundaries.\n\nJournal of Experimental Psychology: Human Perception & Performance, 22(2):294\u2013317, 1996.\n\n[18] D. Medler, A. Arnoldussen, J. Binder, and M. Seidenberg. The Wisconsin perceptual attribute ratings\n\ndatabase, 2005. http://www.neuro.mcw.edu/ratings/.\n\n[19] G. Miller. The magical number seven plus or minus two: Some limits on our capacity for processing\n\ninformation. Psychological Review, 63(2):81\u201397, 1956.\n\n[20] R. C. O\u2019Reilly and J. L. McClelland. Hippocampal conjunctive encoding, storage, and recall: Avoiding a\n\ntradeoff. Hippocampus, 4:661\u2013682, 1994.\n\n[21] J. Rissanen. Strong optimality of the normalized ML models as universal codes and information in data.\n\nIEEE Transaction on Information Theory, 47(5):17121717, 2001.\n\n[22] H. L. Roediger and K. B. McDermott. Creating false memories: Remembering words not presented in\n\nlists. Journal of Experimental Psychology: Learning, Memory and Cognition, 21(4):803\u2013814, 1995.\n\n[23] T. Roos. Statistical and Information-Theoretic Methods for Data Analysis. PhD thesis, Department of\n\nComputer Science, University of Helsinki, 2007.\n\n[24] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press,\n\nNew York, NY, USA, 2004.\n\n[25] V. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.\n[26] V. Vapnik, E. Levin, and Y. Le Cun. Measuring the VC-dimension of a learning machine. Neural Com-\n\nputation, 6:851\u2013876, 1994.\n\n[27] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to\n\ntheir probabilities. Theory of Probability and its Applications, 16(2):264\u2013280, 1971.\n\n[28] W. D. Wattenmaker. Knowledge structures and linear separability: Integrating information in object and\n\nsocial categorization. Cognitive Psychology, 28(3):274\u2013328, 1995.\n\n[29] W. D. Wattenmaker, G. I. Dewey, T. D. Murphy, and D. L. Medin. Linear separability and concept\nlearning: Context, relational properties, and concept naturalness. Cognitive Psychology, 18(2):158\u2013194,\n1986.\n\n9\n\n\f", "award": [], "sourceid": 266, "authors": [{"given_name": "Jerry", "family_name": "Zhu", "institution": null}, {"given_name": "Bryan", "family_name": "Gibson", "institution": null}, {"given_name": "Timothy", "family_name": "Rogers", "institution": null}]}