{"title": "Learning From Small Samples: An Analysis of Simple Decision Heuristics", "book": "Advances in Neural Information Processing Systems", "page_first": 3159, "page_last": 3167, "abstract": "Simple decision heuristics are models of human and animal behavior that use few pieces of information---perhaps only a single piece of information---and integrate the pieces in simple ways, for example, by considering them sequentially, one at a time, or by giving them equal weight. It is unknown how quickly these heuristics can be learned from experience. We show, analytically and empirically, that only a few training samples lead to substantial progress in learning. We focus on three families of heuristics: single-cue decision making, lexicographic decision making, and tallying. Our empirical analysis is the most extensive to date, employing 63 natural data sets on diverse subjects.", "full_text": "Learning From Small Samples: An Analysis of\n\nSimple Decision Heuristics\n\n\u00a8Ozg\u00a8ur S\u00b8ims\u00b8ek and Marcus Buckmann\nCenter for Adaptive Behavior and Cognition\nMax Planck Institute for Human Development\n\nLentzeallee 94, 14195 Berlin, Germany\n\n{ozgur, buckmann}@mpib-berlin.mpg.de\n\nAbstract\n\nSimple decision heuristics are models of human and animal behavior that use few\npieces of information\u2014perhaps only a single piece of information\u2014and integrate\nthe pieces in simple ways, for example, by considering them sequentially, one at\na time, or by giving them equal weight. We focus on three families of heuristics:\nsingle-cue decision making, lexicographic decision making, and tallying.\nIt is\nunknown how quickly these heuristics can be learned from experience. We show,\nanalytically and empirically, that substantial progress in learning can be made with\njust a few training samples. When training samples are very few, tallying performs\nsubstantially better than the alternative methods tested. Our empirical analysis is\nthe most extensive to date, employing 63 natural data sets on diverse subjects.\n\n1\n\nIntroduction\n\nYou may remember that, on January 15, 2009, in New York City, a commercial passenger plane\nstruck a \ufb02ock of geese within two minutes of taking off from LaGuardia Airport. The plane imme-\ndiately and completely lost thrust from both engines, leaving the crew facing a number of critical\ndecisions, one of which was whether they could safely return to LaGuardia. The answer depended\non many factors, including the weight, velocity, and altitude of the aircraft, as well as wind speed and\ndirection. None of these factors, however, are directly involved in how pilots make such decisions.\nAs copilot Jeffrey Skiles discussed in a later interview [1], pilots instead use a single piece of visual\ninformation: whether the desired destination is staying stationary in the windshield. If the desti-\nnation is rising or descending, the plane will undershoot or overshoot the destination, respectively.\nUsing this visual cue, the \ufb02ight crew concluded that LaGuardia was out of reach, deciding instead\nto land on the Hudson River. Skiles reported that subsequent simulation experiments consistently\nshowed that the plane would indeed have crashed before reaching the airport.\nSimple decision heuristics, such as the one employed by the \ufb02ight crew, can provide effective solu-\ntions to complex problems [2, 3]. Some of these heuristics use a single piece of information; others\nuse multiple pieces of information but combine them in simple ways, for example, by considering\nthem sequentially, one at a time, or by giving them equal weight.\nOur work is concerned with two questions: How effective are simple decision heuristics? And\nhow quickly can they be learned from experience? We focus on problems of comparison, where the\nobjective is to decide which of a given set of objects has the highest value on an unobserved criterion.\nThese problems are of fundamental importance in intelligent behavior. Humans and animals spend\nmuch of their time choosing an object to act on, with respect to some criterion whose value is\nunobserved at the time. Choosing a mate, a prey to chase, an investment strategy for a retirement\nfund, or a publisher for a book are just a few examples. Earlier studies on this problem have shown\n\n1\n\n\fthat simple heuristics are surprisingly accurate in natural environments [4, 5, 6, 7, 8, 9], especially\nwhen learning from small samples [10, 11].\nWe present analytical and empirical results on three families of heuristics: lexicographic decision\nmaking, tallying, and single-cue decision making. Our empirical analysis is the most extensive\nto date, employing 63 natural environments on diverse subjects. Our main contributions are as\nfollows: (1) We present analytical results on the rate of learning heuristics from experience. (2) We\nshow that very few learning instances can yield effective heuristics. (3) We empirically investigate\nsingle-cue decision making and \ufb01nd that its performance is remarkable. (4) We \ufb01nd that the most\nrobust decision heuristic for small sample sizes is tallying. Collectively, our results have important\nimplications for developing more successful heuristics and for studying how well simple heuristics\ncapture human and animal decision making.\n\n2 Background\n\nThe comparison problem asks which of a given set of objects has the highest value on an unobserved\ncriterion, given a number of attributes of the objects. We focus on pairwise comparisons, where\nexactly two objects are being compared. We consider a decision to be accurate if it selects the object\nwith the higher criterion value (or either object if they are equal in criterion value). In the heuristics\nliterature, attributes are called cues; we will follow this custom when discussing heuristics.\nThe heuristics we consider decide by comparing the objects on one or more cues, asking which\nobject has the higher cue value. Importantly, they do not require the difference in cue value to be\nquanti\ufb01ed. For example, if we use height of a person as a cue, we need to be able to determine which\nof two people is taller but we do not need to know the height of either person or the magnitude of the\ndifference. Each cue is associated with a direction of inference, also known as cue direction, which\ncan be positive or negative, favoring the object with the higher or lower cue value, respectively. Cue\ndirections (and other components of heuristics) can be learned in a number of ways, including social\nlearning. In our analysis, we learn them from training examples.\nSingle-cue decision making is perhaps the simplest decision method one can imagine. It compares\nthe objects on a single cue, breaking ties randomly. We learn the identity of the cue and its direction\nfrom a training sample. Among the 2k possible models, where k is the number of cues, we choose\nthe (cid:104)cue, direction(cid:105) combination that has the highest accuracy in the training sample, breaking ties\nrandomly.\nLexicographic heuristics consider the cues one at a time, in a speci\ufb01ed order, until they \ufb01nd a cue\nthat discriminates between the objects, that is, one whose value differs on the two objects. The\nheuristic then decides based on that cue alone. An example is take-the-best [12], which orders cues\nwith respect to decreasing validity on the training sample, where validity is the accuracy of the cue\namong pairwise comparisons on which the cue discriminates between the objects.\nTallying is a voting model. It determines how each cue votes on its own (selecting one or the other\nobject or abstaining from voting) and selects the object with the highest number of votes, breaking\nties randomly. We set cue directions to the direction with highest validity in the training set.\nPaired comparison can also be formulated as a classi\ufb01cation problem. Let yA denote the criterion\nvalue of object A, xA the vector of attribute values of object A, and \u2206yAB = yA\u2212 yB the difference\nin criterion values of objects A and B. We can de\ufb01ne the class f of a pair of objects as a function of\nthe difference in their criterion values:\n\n(cid:40) 1\n\n\u22121\n0\n\nf (\u2206yAB) =\n\nif \u2206yAB > 0\nif \u2206yAB < 0\nif \u2206yAB = 0\n\nA class value of 1 denotes that object A has the higher criterion value, \u22121 that object B has the\nhigher criterion value, and 0 that the objects are equal in criterion value. The comparison problem\nis intrinsically symmetrical: comparing A to B should give us the same decision as comparing B to\nA. That is, f (\u2206yAB) should equal \u2212f (\u2206yBA). Because the latter equals \u2212f (\u2212\u2206yAB), we have\nthe following symmetry constraint: f (z) = \u2212f (\u2212z), for all z. We can expect better classi\ufb01cation\naccuracy if we impose this symmetry constraint on our classi\ufb01er.\n\n2\n\n\f3 Building blocks of decision heuristics\n\nWe \ufb01rst examine two building blocks of learning heuristics from experience: assigning cue direction\nand determining which of two cues has the higher predictive accuracy. The former is important for\nall three families of heuristics whereas the latter is important for lexicographic heuristics when\ndetermining which cue should be placed \ufb01rst. Both components are building blocks of heuristics in\na broader sense\u2014their use is not limited to the three families of heuristics considered here.\nLet A and B be the objects being compared, xA and xB denote their cue values, yA and yB denote\ntheir criterion values, and sgn denote the mathematical sign function: sgn(x) is 1 if x > 0, 0 if\nx = 0, and \u22121 if x < 0. A single training instance is the tuple (cid:104)sgn(xA \u2212 xB), sgn(yA \u2212 yB)(cid:105),\ncorresponding to a single pairwise comparison, indicating whether the cue and the criterion change\nfrom one object to the other, along with the direction of change. For example, if xA = 1, yA = 10,\nxB = 2, yB = 5, the training instance is (cid:104)\u22121, +1(cid:105).\nLearning cue direction. We assume, without loss of generality, that cue direction in the population\nis positive (we ignore the case where the cue direction in the population is neutral). Let p denote the\nsuccess rate of the cue in the population, where success is the event that the cue decides correctly.\nWe examine two probabilities, e1 and e2. The former is the probability of correctly inferring the cue\ndirection from a set of training instances. The latter is the probability of deciding correctly on a new\n(unseen) instance using the direction inferred from the training instances.\nWe de\ufb01ne an informative instance to be one in which the objects differ both in their cue values and\nin their criterion values, a positive instance to be one in which the cue and the criterion change in\nthe same direction ((cid:104)1, 1(cid:105) or (cid:104)\u22121,\u22121(cid:105)), and a negative instance to be one in which the cue and the\ncriterion change in the opposite direction ((cid:104)1,\u22121(cid:105) or (cid:104)\u22121, 1(cid:105)).\nLet n be the number of training instances, n+ the number of positive training instances, and n\u2212\nthe number of negative training instances. Our estimate of cue direction is positive if n+ > n\u2212,\nnegative if n+ < n\u2212, and a random choice between positive and negative if n+ = n\u2212.\nGiven a set of independent, informative training instances, n+ follows the binomial distribution with\nn trials and success probability p, allowing us to write e1 as follows:\n\ne1 = P (n+ > n\u2212) +\n\nn(cid:88)\n\n=\n\nk=(cid:98)n/2(cid:99)+1\n\nk\n\n(cid:18)n\n\n(cid:19)\n\nP (n+ = n\u2212)\n\n1\n2\npk(1 \u2212 p)n\u2212k + I(n is even)\n\n(cid:18) n\n\n(cid:19)\n\nn/2\n\n1\n2\n\npn/2(1 \u2212 p)n/2,\n\nwhere I is the indicator function. After one training instance, e1 equals p. After one more instance,\ne1 remains the same. This is a general property: After an odd number of training instances, an\nadditional instance does not increase the probability of inferring the direction correctly.\nOn a new (test) instance, the cue decides correctly with probability p if cue direction is inferred\ncorrectly and with probability 1 \u2212 p otherwise. Consequently, e2 = pe1 + (1 \u2212 p)(1 \u2212 e1).\nSimple algebra yields the following expected learning rates: After 2k + 1 training instances, with\ntwo additional instances, the increase in the probability of inferring cue direction correctly is (2p \u2212\n1)(p(1 \u2212 p))k+1 and the increase in the probability of deciding correctly is\n(2p \u2212 1)2(p(1 \u2212 p))k+1 .\nFigure 1 shows e1 and e2 as a function of training-set size n and success rate p. The more predictive\nthe cue is, the smaller the sample needs to be for a desired level of accuracy in both e1 and e2. This is\nof course a desirable property: The more useful the cue is, the faster we learn how to use it correctly.\nThe \ufb01gure also shows that there are highly diminishing returns, from one odd training-set size to the\nnext, as the size of the training set increases. In fact, just a few instances make great progress toward\nthe maximum possible. The third plot in the \ufb01gure reveals this property more clearly. It shows e2\ndivided by its maximum possible value (p) showing how quickly we reach the maximum possible\naccuracy for cues of various predictive ability. The minimum value depicted in this \ufb01gure is 0.83,\nobserved at n = 1. This means that even after a single training instance, our expected accuracy is at\nleast 83% of the maximum accuracy we can reach. And this value rises quickly with each additional\npair of training instances.\n\n3\n\n\fFigure 1: Learning cue direction.\n\n(cid:88)\n\n0\u2264j* q. We expand the de\ufb01nition of informative instance to require that the objects differ on\nthe second cue as well. We examine two probabilities, e3 and e4. The former is the probability of\nordering the two cues correctly, which means placing the cue with higher success rate above the\nother one. The latter is the probability of deciding correctly with the inferred order. We chose to\nexamine learning to order cues independently of learning cue directions. One reason is that people\ndo not necessarily learn the cue directions from experience. In many cases, they can guess the cue\ndirection correctly through causal reasoning, social learning, past experience in similar problems, or\nother means. In the analysis below, we assume that the directions are assigned correctly.\nLet s1 and s2 be the success rates of the two cues in the training set. If instances are informative and\nindependent, s1 and s2 follow the binomial distribution with parameters (n, p) and (n, q), allowing\nus to write e3 as follows:\n\ne3 = P (s1 > s2) +\n\n1\n2\n\nP (s1 = s2) =\n\nP (s1 = i)P (s2 = j) +\n\n1\n2\n\nAfter one training instance, e3 is 0.5+0.5(p\u2212q), which is a linear function of the difference between\nthe two success rates.\nIf we order cues correctly, a decision on a test instance is correct with probability p, otherwise with\nprobability q. Thus, e4 = pe3 + q(1 \u2212 e3).\nFigure 2 shows e3 and e4 as a function of p and q after three training instances. In general, larger\nvalues of p, as well as larger differences between p and q, require smaller training sets for a desired\nlevel of accuracy. In other words, learning progresses faster where it is more useful. The third plot\nin the \ufb01gure shows e4 relative to the maximum value it can take, the maximum of p and q. The\nminimum value depicted in this \ufb01gure is 90.9%. If we examine the same \ufb01gure after only a single\ntraining instance, we see that this minimum value is 86.6% (\ufb01gure not shown).\n\nFigure 2: Learning cue order.\n\n4\n\n510152025300.50.60.70.80.91.0n0.50.60.70.80.91.0p(e1)Probability of correctly inferringcue direction510152025300.50.60.70.80.91.0n0.50.60.70.80.91.0(e2)Probability of correctly decidingp510152025300.50.60.70.80.91.0n0.850.900.951.00e2pp0.50.60.70.80.91.00.50.60.70.80.91.00.50.60.70.80.91.0After 3 training instances:Probability of correctly ordering(e3)qp0.50.60.70.80.91.00.50.60.70.80.91.00.50.60.70.80.91.0After 3 training instances:Probability of correctly deciding(e4)qp0.50.60.70.80.91.00.50.60.70.80.91.00.920.940.960.981.00After 3 training instances:e4 / max(p,q)qp\f4 Empirical analysis\n\nWe next present an empirical analysis of 63 natural data sets, most from two earlier studies [4, 13].\nOur primary objective is to examine the empirical learning rates of heuristics. From the analytical\nresults of the preceding section, we expect learning to progress rapidly. A secondary objective is to\nexamine the effectiveness of different ways cues can be ordered in a lexicographic heuristic.\nThe data sets were gathered from a wide variety of sources, including online data repositories, text-\nbooks, packages for R statistical software, statistics and data mining competitions, research publi-\ncations, and individual scientists collecting \ufb01eld data. The subjects were diverse, including biology,\nbusiness, computer science, ecology, economics, education, engineering, environmental science,\nmedicine, political science, psychology, sociology, sports, and transportation. The data sets varied\nin size, ranging from 13 to 601 objects. Many of the smaller data sets contained the entirety of the\npopulation of objects, for example, all 29 islands in the Gal\u00b4apagos archipelago. The data sets are\ndescribed in detail in the supplementary material.\nWe present results on lexicographic heuristics, tallying, single-cue decision making, logistic regres-\nsion, and decision trees trained by CART [14]. We used the CART implementation in rpart [15]\nwith the default splitting criterion Gini, cp=0, minsplit=2, minbucket=1, and 10-fold cross-validated\ncost-complexity pruning. There is no explicit way to implement the symmetry constraint for deci-\nsion trees; we simply augmented the training set with its mirror image with respect to the direction\nof comparison. For logistic regression, we used the glm function of R, setting the intercept to zero\nto implement the symmetry constraint. To the glm function, we input the cues in the order of de-\ncreasing correlation with the criterion so that the weakest cues were dropped \ufb01rst when the number\nof training instances was smaller than the number of cues.\nOrdering cues in lexicographic heuristics. We \ufb01rst examine the different ways lexicographic\nheuristics can order the cues. With k cues, there are k! possible cue orders. Combined with the pos-\nsibility of using each cue with a positive or negative direction, there are 2kk! possible lexicographic\nmodels, a number that increases very rapidly with k. How should we choose one if our top criterion\nis accuracy but we also want to pay attention to computational cost and memory requirements?\nWe consider three methods. The \ufb01rst is a greedy search, where we start by deciding on the \ufb01rst\ncue to be used (along with its direction), then the second, and so on, until we have a fully speci\ufb01ed\nlexicographic model. When deciding on the \ufb01rst cue, we select the one that has the highest validity\nin the training examples. When deciding on the mth cue, m \u2265 2, we select the cue that has the\nhighest validity in the examples left over after using the \ufb01rst m \u2212 1 cues, that is, those examples\nwhere the \ufb01rst m \u2212 1 cues did not discriminate between the two objects. The second method is to\norder cues with respect to their validity in the training examples, as take-the-best does. Evaluating\ncues independently of each other substantially reduces computational and memory requirements but\nperhaps at the expense of accuracy. The third method is to use the lexicographic model\u2014among the\n2kk! possibilities\u2014that gives the highest accuracy in the training examples. Identifying this rule is\nNP-complete [16, 17], and it is unlikely to generalize well, but it will be informative to examine it.\nThe three methods have been compared earlier [18] on a data set consisting of German cities [12],\nwhere the \ufb01tting accuracy of the best, greedy, validity, and random ordering was 0.758, 0.756, 0.742,\nand 0.700, respectively.\nFigure 3 (top panel) shows the \ufb01tting accuracy of each method in each of the 63 data sets when all\npossible pairwise comparisons were conducted among all objects. Because of the long simulation\ntime required, we show an approximation of the best ordering in data sets with seven or more cues.\nIn these data sets, we started with the two lexicographic rules generated by the greedy and the\nvalidity ordering, kept intact the cues that were placed seventh or later in the sequence, and tested\nall possible permutations of their \ufb01rst six cues, trying out both possible cue directions. The \ufb01gure\nalso shows the mean accuracy of random ordering, where cues were used in the direction of higher\nvalidity. In all data sets, greedy ordering was identical or very close in accuracy to the best ordering.\nIn addition, validity ordering was very close to greedy ordering except in a handful of data sets.\nOne explanation is that a continuous cue that is placed \ufb01rst in a lexicographic model makes all (or\nalmost all) decisions and therefore the order of the remaining cues does not matter. We therefore\nalso examine the binary version of each data set where numerical cues were dichotomizing around\nthe median (Figure 3 bottom panel). There was little difference in the relative positions of greedy\nand optimal ordering except in one data set. There was more of a drop in the relative accuracy of\n\n5\n\n\fFigure 3: Fitting accuracy of lexicographic models, with and without dichotomizing the cues.\n\nthe validity ordering, but this method still achieved accuracy close to that of the best ordering in the\nmajority of the data sets.\nWe next examine predictive accuracy. Figure 4 shows accuracies when the models were trained on\n50% of the objects and tested on the remaining 50%, conducting all possible pairwise comparisons\nwithin each group. Mean accuracy across data sets was 0.747 for logistic regression, 0.746 for\nCART, 0.743 for greedy lexicographic and take-the-best, 0.734 for single-cue, and 0.725 for tallying.\nFigure 5 shows learning curves, where we grew the training set one pairwise comparison at a time.\nTwo individual objects provided a single instance for training or testing and were never used again,\nneither in training nor in testing. Consequently, the training instances were independent of each\nother but they were not always informative (as de\ufb01ned in Section 3). The \ufb01gure shows the mean\nlearning curve across all data sets as well as individual learning curves on 16 data sets. We present\nthe graphs without error bars for legibility; the highest standard error of the data points displayed is\n0.0014 in Figure 4 and 0.0026 in Figure 5.\nA few observations are noteworthy: (1) Heuristics were indeed learned rapidly. (2) In the early part\nof the learning curve, tallying generally had the highest accuracy. (3) The performance of single-cue\nwas remarkable. When trained on 50% of the objects, its mean performance was better than tallying,\n0.9 percentage points behind take-the-best, and 1.3 percentage points behind logistic regression. (4)\nTake-the-best performed better than or as well as greedy lexicographic in most data sets. A detailed\ncomparison of the two methods is provided below.\nValidity versus greedy ordering in lexicographic decision making. The learning curves on in-\ndividual data sets took one of four forms: (1) There was no difference in any part of the learning\ncurve. This is the case when a continuous cue is placed \ufb01rst: This cue almost always discriminates\nbetween the objects, and cues further down in the sequence are seldom (if ever) used. Because\ngreedy and validity ordering always agree on the \ufb01rst cue, the learning curves are identical or nearly\nso. Twenty-two data sets were in this \ufb01rst category. (2) Validity ordering was better than greedy\nordering in some parts of the learning curve and never worse. This category included 35 data sets.\n(3) Learning curves crossed: Validity ordering generally started with higher accuracy than greedy\nordering; the difference diminished with increasing training-set size, and eventually greedy ordering\nexceeded validity ordering in accuracy (2 data sets). (4) Greedy ordering was better than validity or-\n\n6\n\n0.50.60.70.80.91.0lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll1154351132175863447356101511519344138631061191138731512546456156654884717658612llBestApproximate bestGreedy orderingValidity orderingRandom orderingNumberof cuesData setsFitting accuracy0.50.60.70.80.91.0lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll1154351132175863447356101511519344138631061191138731512546456156654884717658612Numberof cuesDichotomized data setsFitting accuracy\fFigure 4: Predictive accuracy when models are trained with 50% of the objects in each data set and\ntested on the remaining 50%.\n\ndering in some parts of the learning curve and never worse (4 data sets). To draw these conclusions,\nwe considered a difference to be present if the error bars (\u00b1 2 SE) did not overlap.\n\n5 Discussion\n\nWe isolated two building blocks of decision heuristics and showed analytically that they require very\nfew training instances to learn under conditions that matter the most: when they add value to the\nultimate predictive ability of the heuristic. Our empirical analysis con\ufb01rmed that heuristics typically\nmake substantial progress early in learning.\nAmong the algorithms we considered, the most robust method for very small training sets is tallying.\nEarlier work [11] concluded that take-the-best (with undichotomized cues) is the most robust model\nfor training sets with 3 to 10 objects but tallying (with undichotomized cues) was absent from this\nearlier study. In addition, we found that the performance of single-cue decision making is truly\nremarkable. This heuristic has been analyzed [19] by assuming that the cues and the criterion follow\nthe normal distribution; we are not aware of an earlier analysis of its empirical performance on\nnatural data sets.\nOur analysis of learning curves differs from earlier studies. Most earlier studies [20, 10, 21, 11,\n22] examined performance as a function of number of objects in the training set, where training\ninstances are all possible pairwise comparisons among those objects. Others increased the training\nset one pairwise comparison at a time but did not keep the pairwise comparisons independent of each\nother [23]. In contrast, we increased the training set one pairwise comparison at a time and kept all\npairwise comparisons independent of each other. This makes it possible to examine the incremental\nvalue of each training instance.\nThere is criticism of decision heuristics because of their computational requirements. For instance, it\nhas been argued that take-the-best can be described as a simple algorithm but its successful execution\nrelies on a large amount of precomputation [24] and that the computation of cue validity in the\nGerman city task \u201cwould require 30,627 pairwise comparisons just to establish the cue validity\nhierarchy for predicting city size\u201d [25]. Our results clearly show that the actual computational needs\nof heuristics can be very low if independent pairwise comparisons are used for training. A similar\nresult\u2014that just a few samples may suf\ufb01ce\u2014exists within the context of Bayesian inference [26].\n\nAcknowledgments\n\nThanks to Gerd Gigerenzer, Konstantinos Katsikopoulos, Malte Lichtenberg, Laura Martignon,\nPerke Jacobs, and the ABC Research Group for their comments on earlier drafts of this article.\nThis work was supported by Grant SI 1732/1-1 to \u00a8Ozg\u00a8ur S\u00b8ims\u00b8ek from the Deutsche Forschungsge-\nmeinschaft (DFG) as part of the priority program \u201cNew Frameworks of Rationality\u201d (SPP 1516).\n\n7\n\n0.40.50.60.70.80.91.0AccuracyData sets***************************************************************1145736354113116855421641513710193343566831511710638915454412551164651768878612Number of cuesTake-the-bestGreedy lexicographicSingle-cueLogistic regressionCART*Tallying\fFigure 5: Learning curves.\n\n8\n\nTraining set sizeMean accuracy**************************************11020304050556065707580859095100635734242118181817161514141311110.550.600.650.700.750.80Number ofdata setsTake-the-bestGreedy lexicographicSingle-cueLogistic regressionCART*Tallying0.60.70.80.91.004080120Diamond***************************************Training set sizeAccuracy0.60.70.80.91.0050100150Mileage*****************************************0.60.70.80.91.0050100150Fish*****************************************0.60.70.80.91.005101520Salary****************0.50.60.70.80.905102030Land********************0.50.60.70.80.9020406080CPU************************************0.50.60.70.80.90103050Obesity******************************0.50.60.70.80.9020406080Hitter**************************************0.50.60.70.80.90204060Pitcher*********************************0.50.60.70.80.9010203040Car*************************0.50.60.70.80.9020406080Bodyfat**************************************0.500.600.700.8005102030Lake********************0.500.600.700.80010203040Infant***************************0.500.600.700.800204060Contraception*******************************0.500.600.700.8005152535City**********************0.500.600.700.80020406080Athlete************************************Take-the-bestGreedy lexi.Single-cueLogistic reg.CART*TallyingTraining set sizeAccuracy\fReferences\n[1] C. Rose. Charlie Rose. Television program aired on February 10, 2009.\n[2] G. Gigerenzer, P. M. Todd, and the ABC Research Group. Simple heuristics that make us smart. Oxford\n\nUniversity Press, New York, 1999.\n\n[3] G. Gigerenzer, R. Hertwig, and T. Pachur, editors. Heuristics: The foundations of adaptive behavior.\n\nOxford University Press, New York, 2011.\n\n[4] J. Czerlinski, G. Gigerenzer, and D. G. Goldstein. How good are simple heuristics?, pages 97\u2013118. In\n\n[2], 1999.\n\n[5] L. Martignon and K. B. Laskey. Bayesian benchmarks for fast and frugal heuristics, pages 169\u2013188. In\n\n[2], 1999.\n\n[6] L. Martignon, K. V. Katsikopoulos, and J. K. Woike. Categorization with limited resources: A family of\n\nsimple heuristics. Journal of Mathematical Psychology, 52(6):352\u2013361, 2008.\n\n[7] S. Luan, L. Schooler, and G. Gigerenzer. From perception to preference and on to inference: An\n\napproach\u2013avoidance analysis of thresholds. Psychological Review, 121(3):501, 2014.\n\n[8] K. V. Katsikopoulos. Psychological heuristics for making inferences: De\ufb01nition, performance, and the\n\nemerging theory and practice. Decision Analysis, 8(1):10\u201329, 2011.\n\n[9] K. B. Laskey and L. Martignon. Comparing fast and frugal trees and Bayesian networks for risk as-\nsessment. In K. Makar, editor, Proceedings of the 9th International Conference on Teaching Statistics,\nFlagstaff, Arizona, 2014.\n\n[10] H. Brighton. Robust inference with simple cognitive models. In C. Lebiere and B. Wray, editors, AAAI\nspring symposium: Cognitive science principles meet AI-hard problems, pages 17\u201322. American Associ-\nation for Arti\ufb01cial Intelligence, Menlo Park, CA, 2006.\n\n[11] K. V. Katsikopoulos, L. J. Schooler, and R. Hertwig. The robust beauty of ordinary information. Psycho-\n\nlogical Review, 117(4):1259\u20131266, 2010.\n\n[12] G. Gigerenzer and D. G Goldstein. Reasoning the fast and frugal way: Models of bounded rationality.\n\nPsychological Review, 103(4):650\u2013669, 1996.\n\u00a8O. S\u00b8ims\u00b8ek. Linear decision rule as aspiration for simple decision heuristics. In C. J. C. Burges, L. Bottou,\nM. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing\nSystems 26, pages 2904\u20132912. Curran Associates, Inc., Red Hook, NY, 2013.\n\n[13]\n\n[14] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classi\ufb01cation and regression trees. CRC Press,\n\nBoca Raton, FL, 1984.\n\n[15] T. Therneau, B. Atkinson, and B. Ripley. rpart: Recursive partitioning and regression trees, 2014. R\n\npackage version 4.1-5.\n\n[16] M. Schmitt and L. Martignon. On the accuracy of bounded rationality: How far from optimal is fast and\nfrugal? In Y. Weiss, B. Sch\u00a8olkopf, and J. C. Platt, editors, Advances in Neural Information Processing\nSystems 18, pages 1177\u20131184. MIT Press, Cambridge, MA, 2006.\n\n[17] M. Schmitt and L. Martignon. On the complexity of learning lexicographic strategies. Journal of Machine\n\nLearning Research, 7:55\u201383, 2006.\n\n[18] L. Martignon and U. Hoffrage. Fast, frugal, and \ufb01t: Simple heuristics for paired comparison. Theory and\n\nDecision, 52(1):29\u201371, 2002.\n\n[19] R. M. Hogarth and N. Karelaia. Ignoring information in binary choice with continuous variables: When\n\nis less \u201cmore\u201d? Journal of Mathematical Psychology, 49(2):115\u2013124, 2005.\n\n[20] N. Chater, M. Oaksford, R. Nakisa, and M. Redington. Fast, frugal, and rational: How rational norms\n\nexplain behavior. Organizational Behavior and Human Decision Processes, 90(1):63\u201386, 2003.\n\n[21] H. Brighton and G. Gigerenzer. Bayesian brains and cognitive mechanisms: Harmony or dissonance? In\nN. Chater and M. Oaksford, editors, The probabilistic mind: Prospects for Bayesian cognitive science,\npages 189\u2013208. Oxford University Press, New York, 2008.\n\n[22] H. Brighton and G. Gigerenzer. Are rational actor models \u201crational\u201d outside small worlds? In S. Okasha\nand K. Binmore, editors, Evolution and rationality: Decisions, co-operation, and strategic behaviour,\npages 84\u2013109. Cambridge University Press, Cambridge, 2012.\n\n[23] P. M. Todd and A. Dieckmann. Heuristics for ordering cue search in decision making. In L. K. Saul,\nY. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1393\u2013\n1400. MIT Press, Cambridge, MA, 2005.\n\n[24] B. R. Newell. Re-visions of rationality? Trends in Cognitive Sciences, 9(1):11\u201315, 2005.\n[25] M. R. Dougherty, A. M. Franco-Watkins, and R. Thomas. Psychological plausibility of the theory of\nprobabilistic mental models and the fast and frugal heuristics. Psychological Review, 115(1):199\u2013213,\n2008.\n\n[26] E. Vul, N. Goodman, T. L. Grif\ufb01ths, and J. B. Tenenbaum. One and done? Optimal decisions from very\n\nfew samples. Cognitive Science, 38(4):599\u2013637, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1764, "authors": [{"given_name": "\u00d6zg\u00fcr", "family_name": "\u015eim\u015fek", "institution": "Max Plank Institute for Human Development"}, {"given_name": "Marcus", "family_name": "Buckmann", "institution": "Max Planck Institute for Human Development"}]}*