{"title": "Probing the Compositionality of Intuitive Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 3729, "page_last": 3737, "abstract": "How do people learn about complex functional structure? Taking inspiration from other areas of cognitive science, we propose that this is accomplished by harnessing compositionality: complex structure is decomposed into simpler building blocks. We formalize this idea within the framework of Bayesian regression using a grammar over Gaussian process kernels. We show that participants prefer compositional over non-compositional function extrapolations, that samples from the human prior over functions are best described by a compositional model, and that people perceive compositional functions as more predictable than their non-compositional but otherwise similar counterparts. We argue that the compositional nature of intuitive functions is consistent with broad principles of human cognition.", "full_text": "Probing the Compositionality of Intuitive Functions\n\nEric Schulz\n\nUniversity College London\n\ne.schulz@cs.ucl.ac.uk\n\nJoshua B. Tenenbaum\n\nMIT\n\njbt@mit.edu\n\nDavid Duvenaud\n\nUniversity of Toronto\n\nduvenaud@cs.toronto.edu\n\nMaarten Speekenbrink\nUniversity College London\nm.speekenbrink@ucl.ac.uk\n\nSamuel J. Gershman\nHarvard University\n\ngershman@fas.harvard.edu\n\nAbstract\n\nHow do people learn about complex functional structure? Taking inspiration from\nother areas of cognitive science, we propose that this is accomplished by harnessing\ncompositionality: complex structure is decomposed into simpler building blocks.\nWe formalize this idea within the framework of Bayesian regression using a gram-\nmar over Gaussian process kernels. We show that participants prefer compositional\nover non-compositional function extrapolations, that samples from the human prior\nover functions are best described by a compositional model, and that people per-\nceive compositional functions as more predictable than their non-compositional but\notherwise similar counterparts. We argue that the compositional nature of intuitive\nfunctions is consistent with broad principles of human cognition.\n\n1\n\nIntroduction\n\nFunction learning underlies many intuitive judgments, such as the perception of time, space and\nnumber. All of these tasks require the construction of mental representations that map inputs to\noutputs. Since the space of such mappings is in\ufb01nite, inductive biases are necessary to constrain the\nplausible inferences. What is the nature of human inductive biases over functions?\nIt has been suggested that Gaussian processes (GPs) provide a good characterization of these inductive\nbiases [15]. As we describe more formally below, GPs are distributions over functions that can encode\nproperties such as smoothness, linearity, periodicity, and other inductive biases indicated by research\non human function learning [5, 3]. Lucas et al. [15] showed how Bayesian inference with GP priors\ncan unify previous rule-based and exemplar-based theories of function learning [18].\nA major unresolved question is how people deal with complex functions that are not easily captured\nby any simple GP. Insight into this question is provided by the observation that many complex\nfunctions encountered in the real world can be broken down into compositions of simpler functions\n[6, 11]. We pursue this idea theoretically and experimentally, by \ufb01rst de\ufb01ning a hypothetical\ncompositional grammar for intuitive functions (based on [6]) and then investigating whether this\ngrammar quantitatively predicts human function learning performance. We compare the compositional\nmodel to a \ufb02exible non-compositional model (the spectral mixture representation proposed by [21]).\nBoth models use Bayesian inference to reason about functions, but differ in their inductive biases.\nWe show that (a) participants prefer compositional pattern extrapolations in both forced choice\nand manual drawing tasks; (b) samples elicited from participants\u2019 priors over functions are more\nconsistent with the compositional grammar; and (c) participants perceive compositional functions as\nmore predictable than non-compositional ones. Taken together, these \ufb01ndings provide support for the\ncompositional nature of intuitive functions.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f2 Gaussian process regression as a theory of intuitive function learning\n\nA GP is a collection of random variables, any \ufb01nite subset of which are jointly Gaussian-distributed\n(see [18] for an introduction). A GP can be expressed as a distribution over functions: f \u223c GP(m, k),\nwhere m(x) = E[f (x)] is a mean function modeling the expected output of the function given input\nx, and k(x, x(cid:48)) = E [(f (x) \u2212 m(x))(f (x(cid:48)) \u2212 m(x(cid:48)))] is a kernel function modeling the covariance\nbetween points. Intuitively, the kernel encodes an inductive bias about the expected smoothness of\nfunctions drawn from the GP. To simplify exposition, we follow standard convention in assuming a\nconstant mean of 0.\nConditional on data D = {X, y}, where yn \u223c N (f (xn), \u03c32), the posterior predictive distribution\nfor a new input x\u2217 is Gaussian with mean and variance given by:\n\n(cid:63) (K + \u03c32I)\u22121y\n\nE[f (x(cid:63))|D] = k(cid:62)\nV[f (x(cid:63))|D] = k(x(cid:63), x(cid:63)) \u2212 k(cid:62)\n\n(1)\n(2)\nwhere K is the N \u00d7 N matrix of covariances evaluated at each input in X and k(cid:63) =\n[k(x1, x\u2217), . . . , k(xN , x\u2217)].\nAs pointed out by Grif\ufb01ths et al. [10] (see also [15]), the predictive distribution can be viewed as\nan exemplar (similarity-based) model of function learning [5, 16], since it can be written as a linear\ncombination of the covariance between past and current inputs:\n\n(cid:63) (K + \u03c32I)\u22121k(cid:63),\n\nN(cid:88)\n\n\u221e(cid:88)\n\n(3)\nwith \u03b1 = (K + \u03c32I)\u22121y. Equivalently, by Mercer\u2019s theorem any positive de\ufb01nite kernel can be\nexpressed as an outer product of feature vectors:\n\n\u03b1nk(xn, x(cid:63))\n\nf (x\u2217) =\n\nn=1\n\nk(x, x(cid:48)) =\n\n\u03bbd\u03c6d(x)\u03c6d(x(cid:48)),\n\n(4)\nwhere {\u03c6d(x)} are the eigenfunctions of the kernel and {\u03bbd} are the eigenvalues. The posterior\npredictive mean is a linear combination of the features, which from a psychological perspective can\nbe thought of as encoding \u201crules\u201d mapping inputs to outputs [4, 14]. Thus, a GP can be expressed\nas both an exemplar (similarity-based) model and a feature (rule-based) model, unifying the two\ndominant classes of function learning theories in cognitive science [15].\n\nd=1\n\n3 Structure learning with Gaussian processes\n\nSo far we have assumed a \ufb01xed kernel function. However, humans can adapt to a wide variety of\nstructural forms [13, 8], suggesting that they have the \ufb02exibility to learn the kernel function from\nexperience. The key question addressed in this paper is what space of kernels humans are optimizing\nover\u2014how rich is their representational vocabulary? This vocabulary will in turn act as an inductive\nbias, making some functions easier to learn, and other functions harder to learn.\nBroadly speaking, there are two approaches to parameterizing the kernel space: a \ufb01xed functional\nform with continuous parameters, or a combinatorial space of functional forms. These approaches\nare not mutually exclusive; indeed, the success of the combinatorial approach depends on optimizing\nthe continuous parameters for each form. Nonetheless, this distinction is useful because it allows us\nto separate different forms of functional complexity. A function might have internal structure such\nthat when this structure is revealed, the apparent functional complexity is signi\ufb01cantly reduced. For\nexample, a function composed of many piecewise linear segments might have a long description\nlength under a typical continuous parametrization (e.g., the radial basis kernel described below),\nbecause it violates the smoothness assumptions of the prior. However, conditional on the change-\npoints between segments, the function can be decomposed into independent parts each of which is\nwell-described by a simple continuous parametrization. If internally structured functions are \u201cnatural\nkinds,\u201d then the combinatorial approach may be a good model of human intuitive functions.\nIn the rest of this section, we describe three kernel parameterizations. The \ufb01rst two are continuous,\ndiffering in their expressiveness. The third one is combinatorial, allowing it to capture complex\npatterns by composing simpler kernels. For all kernels, we take the standard approach of choosing\nthe parameter values that optimize the log marginal likelihood.\n\n2\n\n\f3.1 Radial basis kernel\n\nThe radial basis kernel is a commonly used kernel in machine learning applications, embodying the\nassumption that the covariance between function values decays exponentially with input distance:\n\nk(x, x(cid:48)) = \u03b82 exp\n\n,\n\n(5)\n\n(cid:18)\n\u2212|x \u2212 x(cid:48)|2\n\n(cid:19)\n\n2l2\n\nwhere \u03b8 is a scaling parameter and l is a length-scale parameter. This kernel assumes that the same\nsmoothness properties apply globally for all inputs. It provides a standard baseline to compare with\nmore expressive kernels.\n\n3.2 Spectral mixture kernel\n\nThe second approach is based on the fact that any stationary kernel can be expressed as an integral\nusing Bochner\u2019s theorem. Letting \u03c4 = |x \u2212 x(cid:48)| \u2208 RP , then\n\ne2\u03c0is(cid:62)\u03c4 \u03c8(ds).\n\n(6)\n\n(cid:90)\n\nk(\u03c4 ) =\n\nRP\n\nIf \u03c8 has a density S(s), then S is the spectral density of k; S and k are thus Fourier duals [18]. This\nmeans that a spectral density fully de\ufb01nes the kernel and that furthermore every stationary kernel\ncan be expressed as a spectral density. Wilson & Adams [21] showed that the spectral density can be\napproximated by a mixture of Q Gaussians, such that\n\nQ(cid:88)\n\nP(cid:89)\n\nk(\u03c4 ) =\n\nwq\n\nq=1\n\np=1\n\nexp(cid:0)\u22122\u03c02\u03c4 2\n(cid:16)\n\n(cid:16)\n\n(cid:1) cos\n\n(cid:17)\n\n(cid:17)\n\np \u03c5p\nq\n\n2\u03c0\u03c4p\u00b5(p)\n\nq\n\n(7)\n\n(cid:16)\n\n(cid:17)\n\nq\n\n\u03c5(1)\nq\n\nq , . . . , \u00b5(P )\n\u00b5(1)\n\nq\n\n, . . . , \u03c5(P )\n\nand a covariance matrix\n\nHere, the qth component has mean vector \u00b5q =\nMq = diag\n. The result is a non-parametric approach to Gaussian process re-\ngression, in which complex kernels are approximated by mixtures of simpler ones. This approach is\nappealing when simpler kernels fail to capture functional structure. Its main drawback is that because\nstructure is captured implicitly via the spectral density, the building blocks are psychologically less\nintuitive: humans appear to have preferences for linear [12] and periodic [1] functions, which are not\nstraightforwardly encoded in the spectral mixture (though of course the mixture can approximate\nthese functions). Since the spectral kernel has been successfully applied to reverse engineer human\nkernels [22], it is a useful reference of comparison to more structured compositional approaches.\n\n3.3 Compositional kernel\n\nAs positive semide\ufb01nite kernels are closed under addition and multiplication, we can create richly\nstructured and interpretable kernels from well understood base components. For example, by\nsumming kernels, we can model the data as a superposition of independent functions. Figure 1\nshows an example of how different kernels (radial basis, linear, periodic) can be combined. Table 1\nsummarizes the kernels used in our grammar.\n\nFigure 1: Examples of base and compositional kernels.\n\nMany other compositional grammars are possible. For example, we could have included a more\ndiverse set of kernels, and other composition operators (e.g., convolution, scaling) that generate valid\nkernels. However, we believe that our simple grammar is a useful starting point, since the components\nare intuitive and likely to be psychologically plausible. For tractability, we \ufb01x the maximum number\nof combined kernels to 3. Additionally, we do not allow for repetition of kernels in order to restrict\nthe complexity of the kernel space.\n\n3\n\nRBFLINPERPER+LINRBFxPERxf(x)\fLinear\nk(x, x(cid:48)) = (x \u2212 \u03b81)(x(cid:48) \u2212 \u03b81)\n\nRadial basis function\nk(\u03c4 ) = \u03b82\n\n2 exp\n\nTable 1: Utilized base kernels in our compositional grammar. \u03c4 = |x \u2212 x(cid:48)|\n\n\u03b82\n6\n\n(cid:16)\u2212 (\u03c4 )2\n\n2\u03b82\n3\n\n(cid:17)\n\nPeriodic\nk(\u03c4 ) = \u03b82\n\n4 exp\n\n(cid:16)\u2212 2 sin2(\u03c0\u03c4 \u03b85)\n\n(cid:17)\n\n.\n\n4 Experiment 1: Extrapolation\n\nThe \ufb01rst experiment assessed whether people prefer compositional over non-compositional extrapola-\ntions. In experiment 1a, functions were sampled from a compositional GP and different extrapolations\n(mean predictions) were produced using each of the aforementioned kernels. Participants were then\nasked to choose among the 3 different extrapolations for a given function (see Figure 2). In detail, the\noutputs for xlearn = [0, 0.1,\u00b7\u00b7\u00b7 , 7] were used as a training set to which all three kernels were \ufb01tted\nand then used to generate predictions for the test set xtest = [7.1, 7.2,\u00b7\u00b7\u00b7 , 10]. Their mean predictions\nwere then used to generate one plot for every approach that showed the learned input as a blue line and\nthe extrapolation as a red line. The procedure was repeated for 20 different compositional functions.\n\nFigure 2: Screen shot of \ufb01rst choice experiment. Predictions in this example (from left to right) were\ngenerated by a spectral mixture, a radial basis, and a compositional kernel.\n\n52 participants (mean age=36.15, SD = 9.11) were recruited via Amazon Mechanical Turk and\nreceived $0.5 for their participation. Participants were asked to select one of 3 extrapolations\n(displayed as red lines) they thought best completed a given blue line. Results showed that participants\nchose compositional predictions 69%, spectral mixture predictions 17%, and radial basis predictions\n14% of the time. Overall, the compositional predictions were chosen signi\ufb01cantly more often than\nthe other two (\u03c72 = 591.2, p < 0.01) as shown in Figure 3a.\n\n(a) Choice proportion for compositional ground truth. (b) Choice proportion for spectral mixture ground truth.\nFigure 3: Results of extrapolation experiments. Error bars represent the standard error of the mean.\n\nIn experiment 1b, again 20 functions were sampled but this time from a spectral mixture kernel\nand 65 participants (mean age=30, SD = 9.84) were asked to choose among either compositional\nor spectral mixture extrapolations and received $0.5 as before. Results (displayed in Figure 3b)\nshowed that participants again chose compositional extrapolations more frequently (68% vs. 32%,\n\u03c72 = 172.8, p < 0.01), even if the ground truth happened to be generated by a spectral mixture\nkernel. Thus, people seem to prefer compositional over non-compositional extrapolations in forced\nchoice extrapolation tasks.\n\n4\n\nlll0.000.250.500.751.00CompositionalRBFSpectralKernelProportion chosenCompositionalll0.000.250.500.751.00CompositionalSpectralKernelProportion chosenSpectral mixture\f5 Markov chain Monte Carlo with people\n\nIn a second set of experiments, we assessed participants\u2019 inductive biases directly using a Markov\nchain Monte Carlo with People (MCMCP) approach [19]. Participants accept or reject proposed\nextrapolations, effectively simulating a Markov chain whose stationary distribution is in this case the\nposterior predictive. Extrapolations from all possible kernel combinations (up to 3 combined kernels)\nwere generated and stored a priori. These were then used to generate plots of different proposal\nextrapolations (as in the previous experiment). On each trial, participants chose between their most\nrecently accepted extrapolation and a new proposal.\n\n5.1 Experiment 2a: Compositional ground truth\n\nIn the \ufb01rst MCMCP experiment, we sampled functions from compositional kernels. Eight different\nfunctions were sampled from various compositional kernels, the input space was split into training\nand test sets, and then all kernel combinations were used to generate extrapolations. Proposals were\nsampled uniformly from this set. 51 participants with an average age of 32.55 (SD = 8.21) were\nrecruited via Amazon\u2019s Mechanical Turk and paid $1. There were 8 blocks of 30 trials, where each\nblock corresponded to a single training set. We calculated the average proportion of accepted kernels\nover the last 5 trials, as shown in Figure 4.\n\nFigure 4: Proportions of chosen predictions over last 5 trials. Generating kernel marked in red.\n\nIn all cases participants\u2019 subjective probability distribution over kernels corresponded well with the\ndata-generating kernels. Moreover, the inverse marginal likelihood, standardized over all kernels,\ncorrelated highly with the subjective beliefs assessed by MCMCP (\u03c1 = 0.91, p < .01). Thus, partici-\npants seemed to converge to sensible structures when the functions were generated by compositional\nkernels.\n\n5\n\nllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllLIN + PER 1LIN + PER 2LIN + PER 3LIN + PER 4LIN x PERPER x RBF + LINPERLIN + PER + RBF0.00.10.20.30.40.00.20.40.00.10.20.30.000.050.100.150.200.250.00.10.20.30.00.10.20.00.10.20.30.00.10.2lprl+pl+rp+rlxplxrpxrp+r+lpxr+lpxl+rrxl+ppxlxrlprl+pl+rp+rlxplxrpxrp+r+lpxr+lpxl+rrxl+ppxlxrlprl+pl+rp+rlxplxrpxrp+r+lpxr+lpxl+rrxl+ppxlxrlprl+pl+rp+rlxplxrpxrp+r+lpxr+lpxl+rrxl+ppxlxrlprl+pl+rp+rlxplxrpxrp+r+lpxr+lpxl+rrxl+ppxlxrlprl+pl+rp+rlxplxrpxrp+r+lpxr+lpxl+rrxl+ppxlxrlprl+pl+rp+rlxplxrpxrp+r+lpxr+lpxl+rrxl+ppxlxrlprl+pl+rp+rlxplxrpxrp+r+lpxr+lpxl+rrxl+ppxlxrProportion of accepted kernels\f5.2 Experiment 2b: Naturalistic functions\n\nThe second MCMCP experiment assessed what structures people converged to when faced with real\nworld data. 51 participants with an average age of 32.55 (SD = 12.14) were recruited via Amazon\nMechanical Turk and received $1 for their participation. The functions were an airline passenger data\nset, volcano CO2 emission data, the number of gym memberships over 5 years, and the number of\ntimes people googled the band \u201cWham!\u201d over the last 8 years; all shown in Figure 5a. Participants\nwere not told any information about the data set (including input and output descriptions) beyond\nthe input-output pairs. As periodicity in the real world is rarely ever purely periodic, we adapted\nthe periodic component of the grammar by multiplying a periodic kernel with a radial basis kernel,\nthereby locally smoothing the periodic part of the function.1 Apart from the different training sets,\nthe procedure was identical to the last experiment.\n\n(a) Data.\n\n(b) Proportions of chosen predictions over last 5 trials.\n\nFigure 5: Real world data and MCMCP results. Error bars represent the standard error of the mean.\n\nResults are shown in Figure 5b, demonstrating that participants converged to intuitively plausible\npatterns. In particular, for both the volcano and the airline passenger data, participants converged to\ncompositions resembling those found in previous analyses [6]. The correlation between the mean\nproportion of accepted predictions and the inverse standardized marginal likelihoods of the different\nkernels was again signi\ufb01cantly positive (\u03c1 = 0.83, p < .01).\n\n6 Experiment 3: Manual function completion\n\nIn the next experiment, we let participants draw the functions underlying observed data manually.\nAs all of the prior experiments asked participants to judge between \u201cpre-generated\u201d predictions of\nfunctions, we wanted to compare this to how participants generate predictions themselves. On each\nround of the experiment, functions were sampled from the compositional grammar, the number of\npoints to be presented on each trial was sampled uniformly between 100 and 200, and the noise\nvariance was sampled uniformly between 0 and 25. Finally, the size of an unobserved region of the\n\n1See the following page for an example: http://learning.eng.cam.ac.uk/carl/mauna.\n\n6\n\nAirline passengersGym membershipsVolcanoWham!\u2212200\u22121000100200300406080100325350375400510152005010015002004006001960198020000100200300400xyReal world datallllllllllllllllllllllllllllllllllllllllllllllllllllllllAirline passengersGym membershipsVolcanoWham!0.00.10.20.30.40.00.20.40.60.00.10.20.30.00.20.40.6lprl+pl+rp+rlxplxrpxrp+r+lpxr+lpxl+rrxl+ppxlxrlprl+pl+rp+rlxplxrpxrp+r+lpxr+lpxl+rrxl+ppxlxrlprl+pl+rp+rlxplxrpxrp+r+lpxr+lpxl+rrxl+ppxlxrlprl+pl+rp+rlxplxrpxrp+r+lpxr+lpxl+rrxl+ppxlxrProportion of accepted kernels\ffunction was sampled to lie between 5 and 50. Participants were asked to manually draw the function\nbest describing observed data and to inter- and extrapolate this function in two unobserved regions. A\nscreen shot of the experiment is shown in Figure 6.\n\nFigure 6: Manual pattern completion experiment. Extrapolation region is delimited by vertical lines.\n\n36 participants with a mean age of 30.5 (SD = 7.15) were recruited from Amazon Mechanical Turk\nand received $2 for their participation. Participants were asked to draw lines in a cloud of dots that\nthey thought best described the given data. To facilitate this process, participants placed black dots\ninto the cloud, which were then automatically connected by a black line based on a cubic Bezier\nsmoothing curve. They were asked to place the \ufb01rst dot on the left boundary and the \ufb01nal dot on the\nright boundary of the graph. In between, participants were allowed to place as many dots as they\nliked (from left to right) and could remove previously placed dots. There were 50 trials in total. We\nassessed the average root mean squared distance between participants\u2019 predictions (the line they drew)\nand the mean predictions of each kernel given the data participants had seen, for both interpolation\nand extrapolation areas. Results are shown in Figure 7.\n\n(a) Distance for interpolation drawings.\n\n(b) Distance for extrapolation drawings.\n\nFigure 7: Root mean squared distances. Error bars represent the standard error of the mean.\n\nThe mean distance from participants\u2019 drawings was signi\ufb01cantly higher for the spectral mixture\nkernel than for the compositional kernel in both interpolation (86.96 vs. 58.33, t(1291.1) = \u22126.3,\np < .001) and extrapolation areas (110.45 vs 83.91, t(1475.7) = 6.39, p < 0.001). The radial basis\nkernel produced similar distances as the compositional kernel in interpolation (55.8), but predicted\nparticipants\u2019 drawings signi\ufb01cantly worse in extrapolation areas (97.9, t(1459.9) = 3.26, p < 0.01).\n\n7 Experiment 4: Assessing predictability\n\nCompositional patterns might also affect the way in which participants perceive functions a priori\n[20]. To assess this, we asked participants to judge how well they thought they could predict 40\ndifferent functions that were similar on many measures such as their spectral entropy and their average\nwavelet distance to each other, but 20 of which were sampled from a compositional and 20 from a\nspectral mixture kernel. Figure 8 shows a screenshot of the experiment.\n50 participants with a mean age of 32 (SD = 7.82) were recruited via Amazon Mechanical Turk and\nreceived $0.5 for their participation. Participants were asked to rate the predictability of different\nfunctions. On each trial participants were shown a total of nj \u2208 {50, 60, . . . , 100} randomly sampled\ninput-output points of a given function and asked to judge how well they thought they could predict the\noutput for a randomly sampled input point on a scale of 0 (not at all) to 100 (very well). Afterwards,\nthey had to rate which of two functions was easier to predict (Figure 8) on a scale from -100 (left\ngraph is de\ufb01nitely easier to predict) to 100 (right graph is de\ufb01nitely easier predict).\nAs shown in Figure 9, compositional functions were perceived as more predictable than spectral\nfunctions in isolation (t(948) = 11.422, p < 0.01) and in paired comparisons (t(499) = 13.502,\np < 0.01). Perceived predictability increases with the number of observed outputs (r = 0.23,\n\n7\n\nlll0255075CompositionalRBFSpectralKernelRMSEInterpolationlll0306090CompositionalRBFSpectralKernelRMSEExtrapolation\f(a) Predictability judgements.\n\n(b) Comparative judgements.\n\nFigure 8: Screenshot of the predictablity experiment.\n\n(a) Predictability judgements.\n\n(b) Comparative judgements.\n\nFigure 9: Results of the predictablity experiment. Error bars represent the standard error of the mean.\n\np < 0.01) and the larger the number of observations, the larger the difference between compositional\nand spectral mixture functions (r = 0.14, p < 0.01).\n\n8 Discussion\n\nIn this paper, we probed human intuitions about functions and found that these intuitions are best\ndescribed as compositional. We operationalized compositionality using a grammar over kernels within\na GP regression framework and found that people prefer extrapolations based on compositional kernels\nover other alternatives, such as a spectral mixture or the standard radial basis kernel. Two Markov\nchain Monte Carlo with people experiments revealed that participants converge to extrapolations\nconsistent with the compositional kernels. These \ufb01ndings were replicated when people manually drew\nthe functions underlying observed data. Moreover, participants perceived compositional functions as\nmore predictable than non-compositional \u2013 but otherwise similar \u2013 ones.\nThe work presented here is connected to several lines of previous research, most importantly that\nof Lucas et al. [15], which introduced GP regression as a model of human function learning, and\nWilson et al. [22], which attempted to reverse-engineer the human kernel using a spectral mixture.\nWe see our work as complementary; we need both a theory to describe how people make sense of\nstructure as well as a method to indicate what the \ufb01nal structure might look like when represented\nas a kernel. Our approach also ties together neatly with past attempts to model structure in other\ncognitive domains such as motion perception [9] and decision making [7].\nOur work can be extended in a number of ways. First, it is desirable to more thoroughly explore\nthe space of base kernels and composition operators, since we used an elementary grammar in\nour analyses that is probably too simple. Second, the compositional approach could be used in\ntraditional function learning paradigms (e.g., [5, 14]) as well as in active input selection paradigms\n[17]. Another interesting avenue for future research would be to explore the broader implications of\ncompositional function representations. For example, evidence suggests that statistical regularities\nreduce perceived numerosity [23] and increase memory capacity [2]; these tasks can therefore provide\nclues about the underlying representations. If compositional functions alter number perception or\nmemory performance to a greater extent than alternative functions, that suggests that our theory\nextends beyond simple function learning.\n\n8\n\nllllllllllll30405060705060708090100Sample SizeMean JudgementGroupllCompositionalSpectralPredictabilityllllll010203040505060708090100Sample SizeMean JudgementDirect Comparison\fReferences\n[1] L. Bott and E. Heit. Nonmonotonic extrapolation in function learning. Journal of Experimental Psychology:\n\nLearning, Memory, and Cognition, 30:38\u201350, 2004.\n\n[2] T. F. Brady, T. Konkle, and G. A. Alvarez. A review of visual memory capacity: Beyond individual items\n\nand toward structured representations. Journal of Vision, 11:4\u20134, 2011.\n\n[3] B. Brehmer. Hypotheses about relations between scaled variables in the learning of probabilistic inference\n\ntasks. Organizational Behavior and Human Performance, 11(1):1\u201327, 1974.\n\n[4] J. D. Carroll. Functional learning: The learning of continuous functional mappings relating stimulus and\n\nresponse continua. Educational Testing Service, 1963.\n\n[5] E. L. DeLosh, J. R. Busemeyer, and M. A. McDaniel. Extrapolation: The sine qua non for abstraction in\nfunction learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 23(4):968,\n1997.\n\n[6] D. Duvenaud, J. R. Lloyd, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Structure discovery in\nnonparametric regression through compositional kernel search. Proceedings of the 30th International\nConference on Machine Learning, pages 1166\u20131174, 2013.\n\n[7] S. J. Gershman, J. Malmaud, J. B. Tenenbaum, and S. Gershman. Structured representations of utility in\n\ncombinatorial domains. Decision, 2016.\n\n[8] S. J. Gershman and Y. Niv. Learning latent structure: carving nature at its joints. Current Opinion in\n\nNeurobiology, 20:251\u2013256, 2010.\n\n[9] S. J. Gershman, J. B. Tenenbaum, and F. J\u00e4kel. Discovering hierarchical motion structure. Vision Research,\n\n2016.\n\n[10] T. L. Grif\ufb01ths, C. Lucas, J. Williams, and M. L. Kalish. Modeling human function learning with gaussian\n\nprocesses. In Advances in Neural Information Processing Systems, pages 553\u2013560, 2009.\n\n[11] R. Grosse, R. R. Salakhutdinov, W. T. Freeman, and J. B. Tenenbaum. Exploiting compositionality to\n\nexplore a large space of model structures. Uncertainty in Arti\ufb01cial Intelligence, 2012.\n\n[12] M. L. Kalish, T. L. Grif\ufb01ths, and S. Lewandowsky.\n\nIterated learning: Intergenerational knowledge\n\ntransmission reveals inductive biases. Psychonomic Bulletin & Review, 14:288\u2013294, 2007.\n\n[13] C. Kemp and J. B. Tenenbaum. Structured statistical models of inductive reasoning. Psychological Review,\n\n116:20\u201358, 2009.\n\n[14] K. Koh and D. E. Meyer. Function learning: Induction of continuous stimulus-response relations. Journal\n\nof Experimental Psychology: Learning, Memory, and Cognition, 17:811\u2013836, 1991.\n\n[15] C. G. Lucas, T. L. Grif\ufb01ths, J. J. Williams, and M. L. Kalish. A rational model of function learning.\n\nPsychonomic bulletin & review, 22(5):1193\u20131215, 2015.\n\n[16] M. A. Mcdaniel and J. R. Busemeyer. The conceptual basis of function learning and extrapolation:\nComparison of rule-based and associative-based models. Psychonomic Bulletin & Review, 12:24\u201342, 2005.\n\n[17] P. Parpart, E. Schulz, M. Speekenbrink, and B. C. Love. Active learning as a means to distinguish among\nprominent decision strategies. In Proceedings of the 37th Annual Meeting of the Cognitive Science Society,\npages 1829\u20131834, 2015.\n\n[18] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n\n[19] A. N. Sanborn, T. L. Grif\ufb01ths, and R. M. Shiffrin. Uncovering mental representations with Markov chain\n\nMonte Carlo. Cognitive Psychology, 60(2):63\u2013106, 2010.\n\n[20] E. Schulz, J. B. Tenenbaum, D. N. Reshef, M. Speekenbrink, and S. J. Gershman. Assessing the perceived\npredictability of functions. In Proceedings of the 37th Annual Meeting of the Cognitive Science Society,\npages 2116\u20132121. Cognitive Science Society, 2015.\n\n[21] A. G. Wilson and R. P. Adams. Gaussian process kernels for pattern discovery and extrapolation. arXiv\n\npreprint arXiv:1302.4245, 2013.\n\n[22] A. G. Wilson, C. Dann, C. Lucas, and E. P. Xing. The human kernel. In Advances in Neural Information\n\nProcessing Systems, pages 2836\u20132844, 2015.\n\n[23] J. Zhao and R. Q. Yu. Statistical regularities reduce perceived numerosity. Cognition, 146:217\u2013222, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1854, "authors": [{"given_name": "Eric", "family_name": "Schulz", "institution": "University College London"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}, {"given_name": "David", "family_name": "Duvenaud", "institution": "University of Toronto"}, {"given_name": "Maarten", "family_name": "Speekenbrink", "institution": "University College London"}, {"given_name": "Samuel", "family_name": "Gershman", "institution": "Harvard University"}]}