{"title": "The Human Kernel", "book": "Advances in Neural Information Processing Systems", "page_first": 2854, "page_last": 2862, "abstract": "Bayesian nonparametric models, such as Gaussian processes, provide a compelling framework for automatic statistical modelling: these models have a high degree of flexibility, and automatically calibrated complexity. However, automating human expertise remains elusive; for example, Gaussian processes with standard kernels struggle on function extrapolation problems that are trivial for human learners. In this paper, we create function extrapolation problems and acquire human responses, and then design a kernel learning framework to reverse engineer the inductive biases of human learners across a set of behavioral experiments. We use the learned kernels to gain psychological insights and to extrapolate in human-like ways that go beyond traditional stationary and polynomial kernels. Finally, we investigate Occam's razor in human and Gaussian process based function learning.", "full_text": "The Human Kernel\n\nAndrew Gordon Wilson\n\nCMU\n\nChristoph Dann\n\nCMU\n\nChristopher G. Lucas\nUniversity of Edinburgh\n\nEric P. Xing\n\nCMU\n\nAbstract\n\nBayesian nonparametric models, such as Gaussian processes, provide a com-\npelling framework for automatic statistical modelling: these models have a high\ndegree of \ufb02exibility, and automatically calibrated complexity. However, automat-\ning human expertise remains elusive; for example, Gaussian processes with stan-\ndard kernels struggle on function extrapolation problems that are trivial for human\nlearners. In this paper, we create function extrapolation problems and acquire hu-\nman responses, and then design a kernel learning framework to reverse engineer\nthe inductive biases of human learners across a set of behavioral experiments. We\nuse the learned kernels to gain psychological insights and to extrapolate in human-\nlike ways that go beyond traditional stationary and polynomial kernels. Finally, we\ninvestigate Occam\u2019s razor in human and Gaussian process based function learning.\n\nIntroduction\n\n1\nTruly intelligent systems can learn and make decisions without human intervention. Therefore it\nis not surprising that early machine learning efforts, such as the perceptron, have been neurally\ninspired [1]. In recent years, probabilistic modelling has become a cornerstone of machine learning\napproaches [2, 3, 4], with applications in neural processing [5, 6, 3, 7] and human learning [8, 9].\nFrom a probabilistic perspective, the ability for a model to automatically discover patterns and per-\nform extrapolation is determined by its support (which solutions are a priori possible), and inductive\nbiases (which solutions are a priori likely). Ideally, we want a model to be able to represent many\npossible solutions to a given problem, with inductive biases which can extract intricate structure\nfrom limited data. For example, if we are performing character recognition, we would want our\nsupport to contain a large collection of potential characters, accounting even for rare writing styles,\nand our inductive biases to reasonably re\ufb02ect the probability of encountering each character [10].\nThe support and inductive biases of a wide range of probabilistic models, and thus the ability for\nthese models to learn and generalise, is implicitly controlled by a covariance kernel, which deter-\nmines the similarities between pairs of datapoints. For example, Bayesian basis function regression\n(including, e.g., all polynomial models), splines, and in\ufb01nite neural networks, can all exactly be rep-\nresented as a Gaussian process with a particular kernel function [11, 10, 12]. Moreover, the Fisher\nkernel provides a mechanism to reformulate probabilistic generative models as kernel methods [13].\nIn this paper, we wish to reverse engineer human-like support and inductive biases for function\nlearning, using a Gaussian process (GP) based kernel learning formalism. In particular:\n\n\u2022 We create new human function learning datasets, including novel function extrapolation\nproblems and multiple-choice questions that explore human intuitions about simplicity and\nexplanatory power, available at http://functionlearning.com/.\n\u2022 We develop a statistical framework for kernel learning from the predictions of a model,\nconditioned on the (training) information that model is given. The ability to sample multiple\nsets of posterior predictions from a model, at any input locations of our choice, given any\ndataset of our choice, provides unprecedented statistical strength for kernel learning. By\ncontrast, standard kernel learning involves \ufb01tting a kernel to a \ufb01xed dataset that can only be\nviewed as a single realisation from a stochastic process. Our framework leverages spectral\nmixture kernels [14] and non-parametric estimates.\n\n1\n\n\f\u2022 We exploit this framework to directly learn kernels from human responses, which contrasts\nwith all prior work on human function learning, where one compares a \ufb01xed model to hu-\nman responses. Further, we consider individual rather than averaged human extrapolations.\n\u2022 We interpret the learned kernels to gain scienti\ufb01c insights into human inductive biases,\nincluding the ability to adapt to new information for function learning. We also use the\nlearned \u201chuman kernels\u201d to inspire new types of covariance functions which can enable\nextrapolation on problems which are dif\ufb01cult for conventional GP models.\n\u2022 We study Occam\u2019s razor in human function learning, and compare to GP marginal likeli-\n\u2022 We provide an expressive quantitative means to compare existing machine learning algo-\nrithms with human learning, and a mechanism to directly infer human prior representations.\n\nhood based model selection, which we show is biased towards under-\ufb01tting.\n\nOur work is intended as a preliminary step towards building probabilistic kernel machines that en-\ncapsulate human-like support and inductive biases. Since state of the art machine learning methods\nperform conspicuously poorly on a number of extrapolation problems which would be easy for\nhumans [12], such efforts have the potential to help automate machine learning and improve perfor-\nmance on a wide range of tasks \u2013 including settings which are dif\ufb01cult for humans to process (e.g.,\nbig data and high dimensional problems). Finally, the presented framework can be considered in\na more general context, where one wishes to ef\ufb01ciently reverse engineer interpretable properties of\nany model (e.g., a deep neural network) from its predictions.\nWe further describe related work in section 2. In section 3 we introduce a framework for learning\nkernels from human responses, and employ this framework in section 4. In the supplement, we\nprovide background on Gaussian processes [11], which we recommend as a review.\n2 Related Work\nHistorically, efforts to understand human function learning have focused on rule-based relationships\n(e.g., polynomial or power-law functions) [15, 16], or interpolation based on similarity learning\n[17, 18]. Grif\ufb01ths et al. [19] were the \ufb01rst to note that a Gaussian process framework can be used to\nunify these two perspectives. They introduced a GP model with a mixture of RBF and polynomial\nkernels to re\ufb02ect the human ability to learn arbitrary smooth functions while still identifying simple\nparametric functions. They applied this model to a standard set of evaluation tasks, comparing\npredictions on simple functions to averaged human judgments, and interpolation performance to\nhuman error rates. Lucas et al. [20, 21] extended this model to accommodate a wider range of\nphenomena, and to shed light on human predictions given sparse data.\nOur work complements these pioneering Gaussian process models and prior work on human func-\ntion learning, but has many features that distinguish it from previous contributions: (1) rather than\niteratively building models and comparing them to human predictions, based on \ufb01xed assumptions\nabout the regularities humans can recognize, we are directly learning the properties of the human\nmodel through advanced kernel learning techniques; (2) essentially all models of function learn-\ning, including past GP models, are evaluated on averaged human responses, setting aside individual\ndifferences and erasing critical statistical structure in the data1. By contrast, our approach uses in-\ndividual responses; (3) many recent model evaluations rely on relatively small and heterogeneous\nsets of experimental data. The evaluation corpora using recent reviews [22, 19] are limited to a\nsmall set of parametric forms, and more detailed analyses tend to involve only linear, quadratic\nand logistic functions. Other projects have collected richer data [23, 24], but we are only aware of\ncoarse-grained, qualitative analyses using these data. Moreover, experiments that depart from sim-\nple parametric functions tend to use very noisy data. Thus it is unsurprising that participants tend\nto revert to the prior mode that arises in almost all function learning experiments: linear functions,\nespecially with slope-1 and intercept-0 [23, 24] (but see [25]). In a departure from prior work, we\ncreate original function learning problems with no simple parametric description and no noise \u2013\nwhere it is obvious that human learners cannot resort to simple rules \u2013 and acquire the human data\nourselves. We hope these novel datasets will inspire more detailed \ufb01ndings on function learning; (4)\nwe learn kernels from human responses, which (i) provide insights into the biases driving human\nfunction learning and the human ability to progressively adapt to new information, and (ii) enable\nhuman-like extrapolations on problems that are dif\ufb01cult for conventional GP models; and (5) we\ninvestigate Occam\u2019s razor in human function learning and nonparametric model selection.\n\n1For example, averaging prior draws from a Gaussian process would remove the structure necessary for\n\nkernel learning, leaving us simply with an approximation of the prior mean function.\n\n2\n\n\f3 The Human Kernel\nThe rule-based and associative theories for human function learning can be uni\ufb01ed as part of a Gaus-\nsian process framework. Indeed, Gaussian processes contain a large array of probabilistic models,\nand have the non-parametric \ufb02exibility to produce in\ufb01nitely many consistent (zero training error) \ufb01ts\nto any dataset. Moreover, the support and inductive biases of a GP are encaspulated by a covariance\nkernel. Our goal is to learn GP covariance kernels from predictions made by humans on function\nlearning experiments, to gain a better understanding of human learning, and to inspire new machine\nlearning models, with improved extrapolation performance, and minimal human intervention.\n3.1 Problem Setup\nA (human) learner is given access to data y at training inputs X, and makes predictions y\u2217 at\ntesting inputs X\u2217. We assume the predictions y\u2217 are samples from the learner\u2019s posterior distribution\nover possible functions, following results showing that human inferences and judgments resemble\nposterior samples across a wide range of perceptual and decision-making tasks [26, 27, 28]. We\nassume we can obtain multiple draws of y\u2217 for a given X and y.\n3.2 Kernel Learning\nIn standard GP applications, one has access to a single realisation of data y, and performs kernel\nlearning by optimizing the marginal likelihood of the data with respect to covariance function hyper-\nparameters \u03b8 (supplement). However, with only a single realisation of data we are highly constrained\nin our ability to learn an expressive kernel function \u2013 requiring us to make strong assumptions, such\nas RBF covariances, to extract useful information from the data. One can see this by simulating\nN datapoints from a GP with a known kernel, and then visualising the empirical estimate yy(cid:62) of\nthe known covariance matrix K. The empirical estimate, in most cases, will look nothing like K.\nHowever, perhaps surprisingly, if we have even a small number of multiple draws from a GP, we\ncan recover a wide array of covariance matrices K using the empirical estimator Y Y (cid:62)/M \u2212 \u00afy\u00afy(cid:62),\nwhere Y is an N \u00d7 M data matrix, for M draws, and \u00afy is a vector of empirical means.\nThe typical goal in choosing kernels is to use training data to \ufb01nd one that minimizes some loss\nfunction, e.g., generalisation error, but here we want to reverse engineer the kernel of a model \u2013\nhere, whatever model human learners are tacitly using \u2013 that has been applied to training data, based\non both training data and predictions of the model. If we have a single sample extrapolation, y\u2217,\nat test inputs X\u2217, based on training points y, and Gaussian noise, the probability p(y\u2217|y, k\u03b8) is\ngiven by the posterior predictive distribution of a Gaussian process, with f\u2217 \u2261 y\u2217. One can use\nthis probability as a utility function for kernel learning, much like the marginal likelihood. See the\nsupplement for details of these distributions.\nOur problem setup affords unprecedented opportunities for \ufb02exible kernel learning. If we have mul-\ntiple sample extrapolations from a given set of training data, y(1)\u2217 , y(2)\u2217 , . . . , y(W )\u2217\n, then the predictive\nj=1 p(y(j)\u2217 |y, k\u03b8). One could apply this new objective,\nfor instance, if we were to view different human extrapolations as multiple draws from a common\ngenerative model. Clearly this assumption is not entirely correct, since different people will have dif-\nferent biases, but it naturally suits our purposes: we are not as interested in the differences between\npeople, as the shared inductive biases, and assuming multiple draws from a common generative\nmodel provides extraordinary statistical strength for learning these shared biases. Ultimately, we\nwill study both the differences and similarities between the responses.\nOne option for kernel learning is to specify a \ufb02exible parametric form for k and then learn \u03b8 by\noptimizing our chosen objective functions. For this approach, we choose the recent spectral mixture\nkernels of Wilson and Adams [14], which can model a wide range of stationary covariances, and are\nintended to help automate kernel selection. However, we note that our objective function can readily\nbe applied to other parametric forms. We also consider empirical non-parametric kernel estimation,\nsince non-parametric kernel estimators can have the \ufb02exibility to converge to any positive de\ufb01nite\nkernel, and thus become appealing when we have the signal strength provided by multiple draws\nfrom a stochastic process.\n4 Human Experiments\nWe wish to discover kernels that capture human inductive biases for learning functions and extrap-\nolating from complex or ambiguous training data. We start by testing the consistency of our kernel\nlearning procedure in section 4.1. In section 4.2, we study progressive function learning. Indeed,\n\nconditional marginal likelihood becomes(cid:81)W\n\n3\n\n\f(a) 1 Posterior Draw\n\n(b) 10 Posterior Draws\n\n(c) 20 Posterior Draws\n\nFigure 1: Reconstructing a kernel used for predictions: Training data were generated with an RBF\nkernel (green, \u00b7\u2212), and multiple independent posterior predictions were drawn from a GP with a\nspectral-mixture prediction kernel (blue, - -). As the number of posterior draws increases, the learned\nspectral-mixture kernel (red, \u2014) converges to the prediction kernel.\n\nhumans participants will have a different representation (e.g., learned kernel) for different observed\ndata, and examining how these representations progressively adapt with new information can shed\nlight on our prior biases. In section 4.3, we learn human kernels to extrapolate on tasks which are\ndif\ufb01cult for Gaussian processes with standard kernels. In section 4.4, we study model selection in\nhuman function learning. All human participants were recruited using Amazon\u2019s mechanical turk\nand saw experimental materials provided at http://functionlearning.com. When we are\nconsidering stationary ground truth kernels, we use a spectral mixture for kernel learning; otherwise,\nwe use a non-parametric empirical estimate.\n\n4.1 Reconstructing Ground Truth Kernels\nWe use simulations with a known ground truth to test the consistency of our kernel learning proce-\ndure, and the effects of multiple posterior draws, in converging to a kernel which has been used to\nmake predictions.\nWe sample 20 datapoints y from a GP with RBF kernel (the supplement describes GPs),\nkRBF(x, x(cid:48)) = exp(\u22120.5||x \u2212 x(cid:48)||/(cid:96)2), at random input locations. Conditioned on these data, we\nthen sample multiple posterior draws, y(1)\u2217 , . . . , y(W )\u2217\n, each containing 20 datapoints, from a GP\nwith a spectral mixture kernel [14] with two components (the prediction kernel). The prediction\nkernel has deliberately not been trained to \ufb01t the data kernel. To reconstruct the prediction kernel,\nwe learn the parameters \u03b8 of a randomly initialized spectral mixture kernel with \ufb01ve components,\n\nby optimizing the predictive conditional marginal likelihood(cid:81)W\n\nj=1 p(y(j)\u2217 |y, k\u03b8) wrt \u03b8.\n\nFigure 1 compares the learned kernels for different numbers of posterior draws W against the data\nkernel (RBF) and the prediction kernel (spectral mixture). For a single posterior draw, the learned\nkernel captures the high-frequency component of the prediction kernel but fails at reconstructing the\nlow-frequency component. Only with multiple draws does the learned kernel capture the longer-\nrange dependencies. The fact that the learned kernel converges to the prediction kernel, which is\ndifferent from the data kernel, shows the consistency of our procedure, which could be used to infer\naspects of human inductive biases.\n\n4.2 Progressive Function Learning\nWe asked humans to extrapolate beyond training data in two sets of 5 functions, each drawn from\nGPs with known kernels. The learners extrapolated on these problems in sequence, and thus had an\nopportunity to progressively learn about the underlying kernel in each set. To further test progressive\nfunction learning, we repeated the \ufb01rst function at the end of the experiment, for six functions in each\nset. We asked for extrapolation judgments because they provide more information about inductive\nbiases than interpolation, and pose dif\ufb01culties for conventional GP kernels [14, 12, 29].\nThe observed functions are shown in black in Figure 2, the human responses in blue, and the true\nextrapolation in dashed black. In the \ufb01rst two rows, the black functions are drawn from a GP with a\nrational quadratic (RQ) kernel [11] (for heavy tailed correlations); there are 20 participants.\nWe show the learned human kernel, the data generating kernel, the human kernel learned from a\nspectral mixture, and an RBF kernel trained only on the data, in Figures 2(g) and 2(h), respectively\ncorresponding to Figures 2(a) and 2(f). Initially, both the human learners and RQ kernel show heavy\ntailed behaviour, and a bias for decreasing correlations with distance in the input space, but the\nhuman learners have a high degree of variance. By the time they have seen Figure 2(h), they are\n\n4\n\n00.20.40.60.81-2-10123400.20.40.60.81-2-10123400.20.40.60.81-2-101234Prediction kernelData kernelLearned kernel\f(a)\n\n(e)\n\n(i)\n\n(b)\n\n(f)\n\n(j)\n\n(c)\n\n(g)\n\n(k)\n\n(m)\n\n(n)\n\n(o)\n\n(d)\n\n(h)\n\n(l)\n\n(p)\n\nFigure 2: Progressive Function Learning. Humans are shown functions in sequence and asked to\nmake extrapolations. Observed data are in black, human predictions in blue, and true extrapolations\nin dashed black. (a)-(f): observed data are drawn from a rational quadratic kernel, with identical data\nin (a) and (f). (g): Learned human and RBF kernels on (a) alone, and (h): on (f), after seeing the data\nin (a)-(e). The true data generating rational quadratic kernel is shown in red. (i)-(n): observed data\nare drawn from a product of spectral mixture and linear kernels with identical data in (i) and (n).\n(o): the empirical estimate of the human posterior covariance matrix from all responses in (i)-(n).\n(p): the true posterior covariance matrix for (i)-(n).\n\nmore con\ufb01dent in their predictions, and more accurately able to estimate the true signal variance of\nthe function. Visually, the extrapolations look more con\ufb01dent and reasonable. Indeed, the human\nlearners will adapt their representations (e.g., learned kernels) to different datasets. However \u2013\nalthough the human learners will adapt their representations (e.g., learned kernels) to observed data\n\u2013 we can see in Figure 2(f) that the human learners are still over-estimating the tails of the kernel,\nperhaps suggesting a strong prior bias for heavy-tailed correlations.\nThe learned RBF kernel, by contrast, cannot capture the heavy tailed nature of the training data (long\nrange correlations), due to its Gaussian parametrization. Moreover, the learned RBF kernel under-\nestimates the signal variance of the data, because it overestimates the noise variance (not shown), to\nexplain away the heavy tailed properties of the data (its model misspeci\ufb01cation).\nIn the second two rows, we consider a problem with highly complex structure, and only 10 par-\nticipants. Here, the functions are drawn from a product of spectral mixture and linear kernels. As\nthe participants see more functions, they appear to expect linear trends, and become more similar\nin their predictions. In Figures 2(o) and 2(p), we show the learned and true predictive correlation\nmatrices using empirical estimators which indicate similar correlation structure.\n\n4.3 Discovering Unconventional Kernels\nThe experiments reported in this section follow the same general procedure described in Section 4.2.\nIn this case, 40 human participants were asked to extrapolate from two single training sets, in coun-\nterbalanced order: a sawtooth function (Figure 3(a)), and a step function (Figure 3(b)), with traing\ndata showing as dashed black lines.\n\n5\n\n0510-2.5-2-1.5-1-0.500.510510-4-20240510-2-10120510-2-10120510-2-1010510-3-2-10102400.511.502400.511.5Human kernelData kernelRBF kernel00.511.52-8-6-4-202400.511.52-8-6-4-202400.511.52-6-4-20200.511.52-6-4-2024600.511.52-4-20246800.511.52-8-6-4-202\f(a)\n\n(b)\n\n(c)\n\n(e)\n\n(i)\n\n(m)\n\n(f)\n\n(j)\n\n(n)\n\n(g)\n\n(k)\n\n(o)\n\n(d)\n\n(h)\n\n(l)\n\n(p)\n\nFigure 3: Learning Unconventional Kernels. (a)-(c): sawtooth function (dashed black), and three\nclusters of human extrapolations. (d) empirically estimated human covariance matrix for (a). (e)-(g):\ncorresponding posterior draws for (a)-(c) from empirically estimated human covariance matrices.\n(h): posterior predictive draws from a GP with a spectral mixture kernel learned from the dashed\nblack data.\n(k)\nand (l) are the empirically estimated human covariance matrices for (i) and (j), and (m) and (n) are\nposterior samples using these matrices. (o) and (p) are respectively spectral mixture and RBF kernel\nextrapolations from the data in black.\n\n(i)-(j): step function (dashed black), and two clusters of human extrapolations.\n\nThese types of functions are notoriously dif\ufb01cult for standard Gaussian process kernels [11], due to\nsharp discontinuities and non-stationary behaviour. In Figures 3(a), 3(b), 3(c), we used agglomer-\native clustering to process the human responses into three categories, shown in purple, green, and\nblue. The empirical covariance matrix of the \ufb01rst cluster (Figure 3(d)) shows the dependencies of\nthe sawtooth form that characterize this cluster. In Figures 3(e), 3(f), 3(g), we sample from the\nlearned human kernels, following the same colour scheme. The samples appear to replicate the hu-\nman behaviour, and the purple samples provide reasonable extrapolations. By contrast, posterior\nsamples from a GP with a spectral mixture kernel trained on the black data in this case quickly\nrevert to a prior mean, as shown in Fig 3(h). The data are suf\ufb01ciently sparse, non-differentiable, and\nnon-stationary, that the spectral mixture kernel is less inclined to produce a long range extrapolation\nthan human learners, who attempt to generalise from a very small amount of information.\nFor the step function, we clustered the human extrapolations based on response time and total vari-\nation of the predicted function. Responses that took between 50 and 200 seconds and did not vary\nby more than 3 units, shown in Figure 3(i), appeared reasonable. The other responses are shown in\nFigure 3(j). The empirical covariance matrices of both sets of predictions in Figures 3(k) and 3(l)\nshow the characteristics of the responses. While the \ufb01rst matrix exhibits a block structure indicating\nstep-functions, the second matrix shows fast changes between positive and negative dependencies\ncharacteristic for the high-frequency responses. Posterior sample extrapolations using the empirical\nhuman kernels are shown in Figures 3(m) and 3(n). In Figures 3(o) and 3(p) we show posterior\nsamples from GPs with spectral mixture and RBF kernels, trained on the black data (e.g., given the\nsame information as the human learners). The spectral mixture kernel is able to extract some struc-\nture (some horizontal and vertical movement), but is overcon\ufb01dent, and unconvincing compared to\nthe human kernel extrapolations. The RBF kernel is unable to learn much structure in the data.\n\n6\n\n00.51-101200.51-101200.51-101200.51-101200.51-101200.51-101200.51-1-0.500.511.5200.51-1-0.500.5100.51-1-0.500.5100.51-1-0.500.5100.51-1-0.500.5100.51-1-0.500.5100.51-1-0.500.51\f4.4 Human Occam\u2019s Razor\nIf you were asked to predict the next number in the sequence 9, 15, 21, . . . , you are likely more\ninclined to guess 27 than 149.5. However, we can produce either answer using different hypotheses\nthat are entirely consistent with the data. Occam\u2019s razor describes our natural tendency to favour the\nsimplest hypothesis that \ufb01ts the data, and is of foundational importance in statistical model selection.\nFor example, MacKay [30] argues that Occam\u2019s razor is automatically embodied by the marginal\nlikelihood in performing Bayesian inference: indeed, in our number sequence example, marginal\nlikelihood computations show that 27 is millions of times more probable than 149.5, even if the\nprior odds are equal.\nOccam\u2019s razor is vitally important in nonparametric models such as Gaussian processes, which have\nthe \ufb02exibility to represent in\ufb01nitely many consistent solutions to any given problem, but avoid over-\n\ufb01tting through Bayesian inference. For example, the marginal likelihood of a Gaussian process\n(supplement) separates into automatically calibrated model \ufb01t and model complexity terms, some-\ntimes referred to as automatic Occam\u2019s razor [31].\n\n(a)\n\n(b)\n\nFigure 4: Bayesian Occam\u2019s Razor. a) The marginal likelihood (evidence) vs. all possible datasets.\nThe dashed vertical line corresponds to an example dataset \u02dcy. b) Posterior mean functions of a GP\nwith RBF kernel and too short, too large, and maximum marginal likelihood length-scales. Data are\ndenoted by crosses.\nThe marginal likelihood p(y|M) is the probability that if we were to randomly sample parameters\nfrom M that we would create dataset y [e.g., 31]. Simple models can only generate a small number\nof datasets, but because the marginal likelihood must normalise, it will generate these datasets with\nhigh probability. Complex models can generate a wide range of datasets, but each with typically low\nprobability. For a given dataset, the marginal likelihood will favour a model of more appropriate\ncomplexity. This argument is illustrated in Fig 4(a). Fig 4(b) illustrates this principle with GPs.\nHere we examine Occam\u2019s razor in human learning, and compare the Gaussian process marginal\nlikelihood ranking of functions, all consistent with the data, to human preferences. We generated a\ndataset sampled from a GP with an RBF kernel, and presented users with a subsample of 5 points,\nas well as seven possible GP function \ufb01ts, internally labelled as follows: (1) the predictive mean of\na GP after maximum marginal likelihood hyperparameter estimation; (2) the generating function;\n(3-7) the predictive means of GPs with larger to smaller length-scales (simpler to more complex\n\ufb01ts). We repeated this procedure four times, to create four datasets in total, and acquired 50 human\nrankings on each, for 200 total rankings. Each participant was shown the same unlabelled functions\nbut with different random orderings.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 5: Human Occam\u2019s Razor. (a) Number of \ufb01rst place (highest ranking) votes for each function.\n(b) Average human ranking (with standard deviations) of functions compared to \ufb01rst place ranking\nde\ufb01ned by (a). (c) Average human ranking vs. average GP marginal likelihood ranking of functions.\n\u2018ML\u2019 = marginal likelihood optimum, \u2018Truth\u2019 = true extrapolation. Blue numbers are offsets to the\nlog length-scale from the ML optimum. Positive offsets correspond to simpler solutions.\n\n7\n\nAllPossibleDatasetsp(y|M)ComplexSimpleAppropriate\u2212202468\u22122\u22121012Output, f(x)Input, x1234567020406080Function LabelFirst Place Votes12345671234567First Choice RankingAverage Human Ranking1234567234567GP Marginal Likelihood RankingAverage Human RankingTruthML-1.0-1.5-2.5+0.5+1.0\fFigure 5(a) shows the number of times each function was voted as the best \ufb01t to the data, which\nfollows the internal (latent) ordering de\ufb01ned above. The maximum marginal likelihood solution\nreceives the most (37%) \ufb01rst place votes. Functions 2, 3, and 4 received similar numbers (between\n15% and 18%) of \ufb01rst place votes. The solutions which have a smaller length-scale (greater com-\nplexity) than the marginal likelihood best \ufb01t \u2013 represented by functions 5, 6, and 7 \u2013 received a\nrelatively small number of \ufb01rst place votes. These \ufb01ndings suggest that on average humans prefer\noverly simple explanations of the data. Moreover, participants generally agree with the GP marginal\nlikelihood\u2019s \ufb01rst choice preference, even over the true generating function. However, these data\nalso suggest that participants have a wide array of prior biases, leading to variability in \ufb01rst choice\npreferences. Furthermore, 86% (43/50) of participants responded that their \ufb01rst ranked choice was\n\u201clikely to have generated the data\u201d and looks \u201cvery similar\u201d to imagined.\nIt\u2019s possible for highly probable solutions to be underrepresented in Figure 5(a): we might imagine,\n\u221a\nfor example, that a particular solution is never ranked \ufb01rst, but always second. In Figure 5(b), we\nshow the average rankings, with standard deviations (the standard errors are stdev/\n200), compared\nto the \ufb01rst choice rankings, for each function. There is a general correspondence between rankings,\nsuggesting that although human distributions over functions have different modes, these distributions\nhave a similar allocation of probability mass. The standard deviations suggest that there is relatively\nmore agreement that the complex small length-scale functions (labels 5, 6, 7) are improbable, than\nabout speci\ufb01c preferences for functions 1, 2, 3, and 4.\nFinally, in Figure 5(c), we compare the average human rankings with the average GP marginal like-\nlihood rankings. There are clear trends: (1) humans agree with the GP marginal likelihood about\nthe best \ufb01t, and that empirically decreasing the length-scale below the best \ufb01t value monotonically\ndecreases a solution\u2019s probability; (2) humans penalize simple solutions less than the marginal like-\nlihood, with function 4 receiving a last (7th) place ranking from the marginal likelihood.\nDespite the observed human tendency to favour simplicity more than the GP marginal likelihood,\nGaussian process marginal likelihood optimisation is surprisingly biased towards under-\ufb01tting in\nfunction space. If we generate data from a GP with a known length-scale, the mode of the marginal\nlikelihood, on average, will over-estimate the true length-scale (Figures 1 and 2 in the supplement).\nIf we are unconstrained in estimating the GP covariance matrix, we will converge to the maximum\nlikelihood estimator, \u02c6K = (y\u2212 \u00afy)(y\u2212 \u00afy)(cid:62), which is degenerate and therefore biased. Parametrizing\na covariance matrix by a length-scale (for example, by using an RBF kernel), restricts this matrix to\na low-dimensional manifold on the full space of covariance matrices. A biased estimator will remain\nbiased when constrained to a lower dimensional manifold, as long as the manifold allows movement\nin the direction of the bias. Increasing a length-scale moves a covariance matrix towards the de-\ngeneracy of the unconstrained maximum likelihood estimator. With more data, the low-dimensional\nmanifold becomes more constrained, and less in\ufb02uenced by this under-\ufb01tting bias.\n5 Discussion\nWe have shown that (1) human learners have systematic expectations about smooth functions that\ndeviate from the inductive biases inherent in the RBF kernels that have been used in past models of\nfunction learning; (2) it is possible to extract kernels that reproduce qualitative features of human\ninductive biases, including the variable sawtooth and step patterns; (3) that human learners favour\nsmoother or simpler functions, even in comparison to GP models that tend to over-penalize com-\nplexity; and (4) that is it possible to build models that extrapolate in human-like ways which go\nbeyond traditional stationary and polynomial kernels.\nWe have focused on human extrapolation from noise-free nonparametric relationships. This ap-\nproach complements past work emphasizing simple parametric functions and the role of noise [e.g.,\n24], but kernel learning might also be applied in these other settings. In particular, iterated learning\n(IL) experiments [23] provide a way to draw samples that re\ufb02ect human learners\u2019 a priori expecta-\ntions. Like most function learning experiments, past IL experiments have presented learners with\nsequential data. Our approach, following Little and Shiffrin [24], instead presents learners with plots\nof functions. This method is useful in reducing the effects of memory limitations and other sources\nof noise (e.g., in perception). It is possible that people show different inductive biases across these\ntwo presentation modes. Future work, using multiple presentation formats with the same underlying\nrelationships, will help resolve these questions.\nFinally, the ideas discussed in this paper could be applied more generally, to discover interpretable\nproperties of unknown models from their predictions. Here one encounters fascinating questions at\nthe intersection of active learning, experimental design, and information theory.\n\n8\n\n\fReferences\n[1] W.S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. Bulletin of\n\nmathematical biology, 5(4):115\u2013133, 1943.\n\n[2] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n[3] K. Doya, S. Ishii, A. Pouget, and R.P.N. Rao. Bayesian brain: probabilistic approaches to neural coding.\n\nMIT Press, 2007.\n\n459, 2015.\n\n1996.\n\n[4] Zoubin Ghahramani. Probabilistic machine learning and arti\ufb01cial intelligence. Nature, 521(7553):452\u2013\n\n[5] Daniel M Wolpert, Zoubin Ghahramani, and Michael I Jordan. An internal model for sensorimotor\n\nintegration. Science, 269(5232):1880\u20131882, 1995.\n\n[6] David C Knill and Whitman Richards. Perception as Bayesian inference. Cambridge University Press,\n\n[7] Sophie Deneve. Bayesian spiking neurons i: inference. Neural computation, 20(1):91\u2013117, 2008.\n[8] Thomas L Grif\ufb01ths and Joshua B Tenenbaum. Optimal predictions in everyday cognition. Psychological\n\nScience, 17(9):767\u2013773, 2006.\n\n[9] J.B. Tenenbaum, C. Kemp, T.L. Grif\ufb01ths, and N.D. Goodman. How to grow a mind: Statistics, structure,\n\nand abstraction. Science, 331(6022):1279\u20131285, 2011.\n\n[10] R.M. Neal. Bayesian Learning for Neural Networks. Springer Verlag, 1996. ISBN 0387947248.\n[11] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for Machine Learning. MIT Press, 2006.\n[12] Andrew Gordon Wilson. Covariance kernels for fast automatic pattern discovery and extrapolation with\n\nGaussian processes. PhD thesis, University of Cambridge, 2014.\nhttp://www.cs.cmu.edu/\u02dcandrewgw/andrewgwthesis.pdf.\n\n[13] Tommi Jaakkola, David Haussler, et al. Exploiting generative models in discriminative classi\ufb01ers. Ad-\n\nvances in neural information processing systems, pages 487\u2013493, 1998.\n\n[14] Andrew Gordon Wilson and Ryan Prescott Adams. Gaussian process kernels for pattern discovery and\n\nextrapolation. International Conference on Machine Learning (ICML), 2013.\n\n[15] J Douglas Carroll. Functional learning: The learning of continuous functional mappings relating stimulus\n\nand response continua. ETS Research Bulletin Series, 1963(2), 1963.\n\n[16] Kyunghee Koh and David E Meyer. Function learning: Induction of continuous stimulus-response rela-\n\ntions. Journal of Experimental Psychology: Learning, Memory, and Cognition, 17(5):811, 1991.\n\n[17] Edward L DeLosh, Jerome R Busemeyer, and Mark A McDaniel. Extrapolation: The sine qua non for\nabstraction in function learning. Journal of Experimental Psychology: Learning, Memory, and Cognition,\n23(4):968, 1997.\n\n[18] Jerome R Busemeyer, Eunhee Byun, Edward L Delosh, and Mark A McDaniel. Learning functional\nrelations based on experience with input-output pairs by humans and arti\ufb01cial neural networks. Concepts\nand Categories, 1997.\n\n[19] Thomas L Grif\ufb01ths, Chris Lucas, Joseph Williams, and Michael L Kalish. Modeling human function\n\nlearning with Gaussian processes. In Neural Information Processing Systems, 2009.\n\n[20] Christopher G Lucas, Thomas L Grif\ufb01ths, Joseph J Williams, and Michael L Kalish. A rational model of\n\nfunction learning. Psychonomic bulletin & review, pages 1\u201323, 2015.\n\n[21] Christopher G Lucas, Douglas Sterling, and Charles Kemp. Superspace extrapolation reveals inductive\n\nbiases in function learning. In Cognitive Science Society, 2012.\n\n[22] Mark A Mcdaniel and Jerome R Busemeyer. The conceptual basis of function learning and extrapolation:\nComparison of rule-based and associative-based models. Psychonomic bulletin & review, 12(1):24\u201342,\n2005.\n\n[23] Michael L Kalish, Thomas L Grif\ufb01ths, and Stephan Lewandowsky. Iterated learning: Intergenerational\nknowledge transmission reveals inductive biases. Psychonomic Bulletin & Review, 14(2):288\u2013294, 2007.\n[24] Daniel R Little and Richard M Shiffrin. Simplicity bias in the estimation of causal functions. In Pro-\n\nceedings of the 31st Annual Conference of the Cognitive Science Society, pages 1157\u20131162, 2009.\n\n[25] Samuel GB Johnson, Andy Jin, and Frank C Keil. Simplicity and goodness-of-\ufb01t in explanation: The\ncase of intuitive curve-\ufb01tting. In Proceedings of the 36th Annual Conference of the Cognitive Science\nSociety, pages 701\u2013706, 2014.\n\n[26] Samuel J Gershman, Edward Vul, and Joshua B Tenenbaum. Multistability and perceptual inference.\n\nNeural computation, 24(1):1\u201324, 2012.\n\n[27] Thomas L Grif\ufb01ths, Edward Vul, and Adam N Sanborn. Bridging levels of analysis for probabilistic\n\nmodels of cognition. Current Directions in Psychological Science, 21(4):263\u2013268, 2012.\n\n[28] Edward Vul, Noah Goodman, Thomas L Grif\ufb01ths, and Joshua B Tenenbaum. One and done? optimal\n\ndecisions from very few samples. Cognitive science, 38(4):599\u2013637, 2014.\n\n[29] Andrew Gordon Wilson, Elad Gilboa, Arye Nehorai, and John P. Cunningham. Fast kernel learning for\n\nmultidimensional pattern extrapolation. In Advances in Neural Information Processing Systems, 2014.\n[30] David JC MacKay. Information theory, inference, and learning algorithms. Cambridge U. Press, 2003.\n[31] Carl Edward Rasmussen and Zoubin Ghahramani. Occam\u2019s razor. In Neural Information Processing\n\n[32] Andrew Gordon Wilson. A process over all stationary kernels. Technical report, University of Cambridge,\n\nSystems (NIPS), 2001.\n\n2012.\n\n9\n\n\f", "award": [], "sourceid": 1621, "authors": [{"given_name": "Andrew", "family_name": "Wilson", "institution": "Carnegie Mellon University"}, {"given_name": "Christoph", "family_name": "Dann", "institution": "Carnegie Mellon University"}, {"given_name": "Chris", "family_name": "Lucas", "institution": "University of Edinburgh"}, {"given_name": "Eric", "family_name": "Xing", "institution": "Carnegie Mellon University"}]}