{"title": "GRIFT: A graphical model for inferring visual classification features from human data", "book": "Advances in Neural Information Processing Systems", "page_first": 1217, "page_last": 1224, "abstract": "This paper describes a new model for human visual classification that enables the recovery of image features that explain human subjects' performance on different visual classification tasks. Unlike previous methods, this algorithm does not model their performance with a single linear classifier operating on raw image pixels. Instead, it models classification as the combination of multiple feature detectors. This approach extracts more information about human visual classification than has been previously possible with other methods and provides a foundation for further exploration.", "full_text": "GRIFT: A graphical model for inferring visual\n\nclassi\ufb01cation features from human data\n\nMichael G. Ross\n\nDepartment of Brain and Cognitive Sciences\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\nmgross@mit.edu\n\nAndrew L. Cohen\n\nPsychology Department\n\nUniversity of Massachusetts Amherst\n\nAmherst, MA 01003\n\nacohen@psych.umass.edu\n\nAbstract\n\nThis paper describes a new model for human visual classi\ufb01cation that enables the\nrecovery of image features that explain human subjects\u2019 performance on differ-\nent visual classi\ufb01cation tasks. Unlike previous methods, this algorithm does not\nmodel their performance with a single linear classi\ufb01er operating on raw image\npixels. Instead, it represents classi\ufb01cation as the combination of multiple feature\ndetectors. This approach extracts more information about human visual classi\ufb01-\ncation than previous methods and provides a foundation for further exploration.\n\n1 Introduction\n\nAlthough a great deal is known about the low-level features computed by the human visual system,\ndetermining the information used to make high-level visual classi\ufb01cations is an active area of re-\nsearch. When a person distinguishes between two faces, for example, what image regions are most\nsalient? Since the early 1970s, one of the most important research tools for answering such questions\nhas been the classi\ufb01cation image (or reverse correlation) algorithm, which assumes a linear classi-\n\ufb01cation model [1]. This paper describes a new approach, GRIFT (GRaphical models for Inferring\nFeature Templates). Instead of representing human visual discrimination as a single linear classi\ufb01er,\nGRIFT models it as the non-linear combination of multiple independently detected features. This\nallows GRIFT to extract more detailed information about human classi\ufb01cation.\nThis paper describes GRIFT and the algorithms for \ufb01tting it to data, demonstrates the model\u2019s ef\ufb01-\ncacy on simulated and human data, and concludes with a discussion of future research directions.\n\n2 Related work\n\nAhumada\u2019s classi\ufb01cation image algorithm [1] models an observer\u2019s classi\ufb01cations of visual stimuli\nwith a noisy linear classi\ufb01er \u2014 a \ufb01xed set of weights and a normally distributed threshold. The\nrandom threshold accounts for the fact that multiple presentations of the same stimulus are often\nclassi\ufb01ed inconsistently.\nIn a typical classi\ufb01cation image experiment, participants are presented\nwith hundreds or thousands of noise-corrupted examples from two categories and asked to classify\neach one. The noise ensures that the samples cover a large volume of the sample space in order to\nallow recovery of a unique linear classi\ufb01er that best explains the data.\nAlthough classi\ufb01cation images are useful in many cases, it is well established that there are domains\nin which recognition and classi\ufb01cation are the result of combining the detection of parts or fea-\ntures, rather than applying a single linear template. For example, Pelli et al. [10], have convincingly\ndemonstrated that humans recognize noisy word images by parts, even when whole-word templates\nwould perform better. Similarly, Gold et al. [7] veri\ufb01ed that subjects employed feature-based clas-\n\n1\n\n\fFigure 1: Left: The GRIFT model is a Bayes net that describes classi\ufb01cation as the result of com-\nbining N feature detectors. Right: Targets and sample stimuli from the three experiments.\n\nsi\ufb01cation strategies for some simple arti\ufb01cial image classes. GRIFT takes the next step and infers\nfeatures which predict human performance directly from classi\ufb01cation data.\nMost work on modeling non-linear, feature-based classi\ufb01cation in humans has focused on verifying\nthe use of a prede\ufb01ned set of features. Recent work by Cohen et al. [4] demonstrates that Gaussian\nmixture models can be used to recover features from human classi\ufb01cation data without specifying\na \ufb01xed set of possible features. The GRIFT model, described in the remainder of this paper, has\nthe same goals as the previous work, but removes several limitations of the Gaussian mixture model\napproach, including the need to only use stimuli the subjects classi\ufb01ed with high con\ufb01dence and\nthe bias that the signals can exert on the recovered features. GRIFT achieves these and other im-\nprovements by generatively modeling the entire classi\ufb01cation process with a graphical model. Fur-\nthermore, the similarity between single-feature GRIFT models and the classi\ufb01cation image process,\ndescribed in more detail below, make GRIFT a natural successor to the traditional approach.\n\n3 GRIFT model\n\nGRIFT models classi\ufb01cation as the result of combining N conditionally independent feature detec-\ntors, F = {F1, F2, . . . , FN}. Each feature detector is binary valued (1 indicates detection), as is\nthe classi\ufb01cation, C (1 indicates one class and 2 the other). The stimulus, S, is an array of con-\ntinuously valued pixels representing the input image. The stimulus only in\ufb02uences C through the\nfeature detectors, therefore the joint probability of a stimulus and classi\ufb01cation pair is\n\nP (C, S) =X\n\n \n\nP (C|F )P (S)\n\nP (Fi|S)\n\n.\n\nNY\n\n!\n\nF\n\ni\n\nFigure 1 represents the causal relationship between these variables (C, F , and S) with a Bayesian\nnetwork. The network also includes nodes representing model parameters (\u03c9, \u03b2, and \u03bb), whose role\nwill be described below. The boxed region in the \ufb01gure indicates the parts of the model that are\nreplicated when N > 1 \u2014 each feature detector is represented by an independent copy of those\nvariables and parameters.\nThe distribution of the stimulus, P (S), is under the control of the experimenter. The algorithm for\n\ufb01tting the model to data only assumes that the stimuli are independent and identically distributed\nacross trials. The conditional distribution of each feature detector\u2019s value, P (Fi|S), is modeled with\na logistic regression function on the pixel values of S. Logistic regression is desirable because it is a\nprobabilistic linear classi\ufb01er. Humans can successfully classify images in the presence of extremely\nhigh additive noise, which suggests the use of averaging and contrast, linear computations which\n\n2\n\nNSFiC\u03c9i\u03bbi\u03b2i\u03bb0four squarefaceslight-darktargetssamplestargetstargetssamplessamplesclass 1class 2class 1class 2class 1class 2\fare known to play important roles in human visual perception [9]. Just as the classi\ufb01cation image\nused a random threshold to represent uncertainty in the output of its single linear classi\ufb01er, logistic\nregression also allows GRIFT to represent uncertainty in the output of each of its feature detectors.\nThe conditional distribution of C is represented by logistic regression on the feature outputs.\nEach Fi\u2019s distribution has two parameters, a weight vector \u03c9i and a threshold \u03b2i, such that\n\nP (Fi = 1|S, \u03c9i, \u03b2i) = (1 + exp(\u03b2i +\n\n\u03c9ijSj))\u22121,\n\nwhere |S| is the number of pixels in a stimulus. Similarly, the conditional distribution of C is\ndetermined by \u03bb = {\u03bb0, \u03bb1, . . . , \u03bbN} where\n\n|S|X\n\nj=1\n\nNX\n\ni=1\n\nP (C = 1|F, \u03bb) = (1 + exp(\u03bb0 +\n\n\u03bbiFi))\u22121.\n\nDetecting a feature with negative \u03bbi increases the probability that the subject will respond \u201cclass 1,\u201d\nthose with positive \u03bbi are associated with \u201cclass 2\u201d responses.\nA GRIFT model with N features applied to the classi\ufb01cation of images each containing |S| pixels\nhas N(|S| + 2) + 1 parameters. This large number of parameters, coupled with the fact that the\nF variables are unobservable, makes \ufb01tting the model to data very challenging. Therefore, GRIFT\nde\ufb01nes prior distributions on its parameters. These priors re\ufb02ect reasonable assumptions about the\nparameter values and, if they are wrong, can be overturned if enough contrary data is available. The\nprior on each of the \u03bbi parameters for which i > 0 is a mixture of two normal distributions,\n\nP (\u03bbi) =\n\n1\n\u221a\n2\n2\u03c0\n\n(exp(\u2212(\u03bbi + 2)2\n\n2\n\n) + exp(\u2212(\u03bbi \u2212 2)2\n\n)).\n\n2\n\nThis prior re\ufb02ects the assumption that each feature detector should have a signi\ufb01cant impact on the\nclassi\ufb01cation, but no single detector should make it deterministic \u2014 a single-feature model with\n\u03bb0 = 0 and \u03bb1 = \u22122 has an 88% chance of choosing class 1 if the feature is active. The \u03bb0\nparameter has an improper non-informative prior, P (\u03bb0) = 1, indicating no preference for any\nparticular value [5] because the best \u03bb0 is largely determined by the other \u03bbis and the distributions\nof F and S. For analogous reasons, P (\u03b2i) = 1.\nThe \u03c9i parameters, which each have dimensionality equal to the stimulus, present the biggest infer-\nential challenge. As mentioned previously, human visual processing is sensitive to contrasts between\nimage regions. If one image region is assigned positive \u03c9ijs and another is assigned negative \u03c9ijs,\nthe feature detector will be sensitive to the contrast between them. This contrast between regions re-\nquires all the pixels within each region to share similar \u03c9ij values. To encourage this local structure,\nthe \u03c9i parameters have Markov random \ufb01eld prior distributions:\n\nP (\u03c9i) \u221d\n\n(exp(\u2212(\u03c9ij + 1)2\n\n2\n\n) + exp(\u2212(\u03c9ij \u2212 1)2\n\n2\n\n))\n\nexp(\u2212(\u03c9ij \u2212 \u03c9ik)2\n\n2\n\n)\n\n\uf8f6\uf8f8\uf8eb\uf8ed Y\n\n(j,k)\u2208A\n\n\uf8eb\uf8edY\n\nj\n\n\uf8f6\uf8f8 ,\n\nNY\n\nwhere A is the set of neighboring pixel locations. The \ufb01rst factor encourages weight values to be\nnear the -1 to 1 range, while the second encourages the assignment of similar weights to neighboring\npixels. Fitting the model to data does not require the normalization of this distribution.\nThe Bayesian joint probability distribution of all the parameters and variables is\n\nP (C, F, S, \u03c9, \u03b2, \u03bb) = P (C|F, \u03bb)P (S)P (\u03bb0)\n\nP (Fi|S, \u03c9i, \u03b2i)P (\u03c9i)P (\u03b2i)P (\u03bbi).\n\n(1)\n\n4 GRIFT algorithm\n\ni=1\n\nThe goal of the algorithm is to \ufb01nd the parameters that satisfy the prior distributions and best ac-\ncount for the (S, C) samples gathered from a human subject. Mathematically, this goal corresponds\nto \ufb01nding the mode of P (\u03c9, \u03b2, \u03bb|S, C), where S and C refer to all of the observed samples. The\n\n3\n\n\falgorithm is derived using the expectation-maximization (EM) method [3], a widely used optimiza-\ntion technique for dealing with unobserved variables, in this case F, the feature detector outputs for\nall the trials. In order to determine the most probable parameter assignments, the algorithm chooses\nrandom initial parameters \u03b8\u2217 = (\u03c9\u2217, \u03b2\u2217, \u03bb\u2217) and then \ufb01nds the \u03b8 that maximizes\nP (F|S, C, \u03b8\u2217) log P (C, F, S|\u03b8) + log P (\u03b8).\n\nQ(\u03b8|\u03b8\u2217) =X\n\nF\n\nQ(\u03b8|\u03b8\u2217) is the expected log posterior probability of the parameters computed by using the current \u03b8\u2217\nto estimate the distribution of F, the unobserved feature detector activations. The \u03b8 that maximizes\nQ then becomes \u03b8\u2217 for the next iteration, and the process is repeated until convergence.\nThe presence of both the P (C, F, S|\u03b8) and P (\u03b8) terms encourages the algorithm to \ufb01nd parameters\nthat explain the data and match the assumptions encoded in the parameter prior distributions. As the\namount of available data increases, the in\ufb02uence of the priors decreases, so it is possible to discover\nfeatures that are contrary to prior belief given enough evidence.\nUsing the conditional independences from the Bayes net:\n\n!\n\nQ(\u03b8|\u03b8\u2217) \u221d X\nNX\n\nF\n\n+\n\n \n\nNX\n\nP (F|S, C, \u03b8\u2217)\n\nlog P (C|F, \u03bb) +\n\nlog P (Fi|S, \u03c9i, \u03b2i)\n\ni=1\n\n(log P (\u03c9i) + log P (\u03bbi)) ,\n\ni=1\n\ndropping the log P (S) term, which is independent of the parameters, and the log P (\u03bb0) and\nlog P (\u03b2i) terms, which are 0. As mentioned before, the normalization terms for the log P (\u03c9i)\nelements can be ignored during optimization \u2014 the log makes them additive constants to Q. The\nfunctional form of every additive term is described in Section 3, and P (F|S, C, \u03b8\u2217) can be calculated\nusing the model\u2019s joint probability function (Equation 1).\nEach iteration of EM requires maximizing Q, but it is not possible to compute the maximizing \u03b8 in\nclosed form. Fortunately, it is relatively easy to search for the best \u03b8. Because Q is separable into\nmany additive components, it is possible to ef\ufb01ciently compute its gradient with respect to each of\nthe elements of \u03b8 and use this information to \ufb01nd a locally maximum \u03b8 assignment using the scaled\nconjugate gradient algorithm [2]. Even a locally maximum value of \u03b8 usually provides good EM\nresults \u2014 P (\u03c9, \u03b2, \u03bb|S, C) is still guaranteed to improve after every iteration.\nThe result of any EM procedure is only guaranteed to be a locally optimal answer, and \ufb01nding the\nglobally optimal \u03b8 is made more challenging by the large number of parameters. GRIFT adopts\nthe standard solution of running EM many times, each instance starting with a random \u03b8\u2217, and then\naccepting the \u03b8 from the run which produced the most probable parameters. For this model and the\ndata presented in the following sections, 20-30 random restarts were suf\ufb01cient.\n\n5 Experiments\n\nThe GRIFT model was \ufb01t to data from 3 experiments.\nIn each experiment, human participants\nclassi\ufb01ed stimuli into two classes. Each class contained one or more target stimuli. In each trial,\nthe participant saw a stimulus (a sample from S) that consisted of a randomly chosen target with\nhigh levels of independent identically distributed noise added to each pixel. The noise samples were\ndrawn from a truncated normal distribution to ensure that the stimulus pixel values remained within\nthe display\u2019s output range. Figure 1 shows the classes and targets from each experiment and a sample\nstimulus from each class. In the four-square experiment four participants were asked to distinguish\nbetween two arti\ufb01cial stimulus classes, one in which there were bright squares in the upper-left\nor upper-right corners and one in which there were bright squares in the lower-left or lower-right\ncorners. In the light-dark experiment three participants were asked to distinguish between three\nstrips that each had two light blobs and three strips that each had only one light blob. Finally, in the\nfaces experiment three participants were asked to distinguish between two faces. The four-square\ndata were collected by [7] and were also analyzed in [4]. The other data are newly gathered. Each\ndata set consists of approximately 4000 trials from each subject. To maintain their interest in the\ntask, participants were given auditory feedback after each trial that indicated success or failure.\n\n4\n\n\fFigure 2: The most probable \u03c9 parameters found for the four-square experiments for different values\nof N and the mutual information between these feature detectors and the observed classi\ufb01cations.\n\nFitting GRIFT models is not especially sensitive to the random initialization procedure used to start\neach EM instance. The \u03bb\u2217 parameters were initialized by normal random samples and then half\nwere negated so the features would tend to start evenly assigned to the two classes, except for \u03bb\u2217\n0,\nwhich was initialized to 0. In the four-square experiments, the \u03c9\u2217 parameters were initialized by\na mixture of normal distributions and in the light-dark experiments they were initialized from a\nuniform distribution. In the faces experiments the \u03c9\u2217 were initialized by adding normal noise to the\noptimal linear classi\ufb01er separating the two targets. Because of the large number of pixels in the faces\nstimuli, the other initialization procedures frequently produced initial assignments with extremely\nlow probabilities, which led to numerical precision problems. In the four-square experiments, the\n\u03b2\u2217 were initialized randomly. In the other experiments, the intent was to set them to the optimal\nthreshold for distinguishing the classes using the initial \u03c9\u2217 as a linear classi\ufb01er, but a programming\nerror set them to the negation of that value. In most cases, the results were insensitive to the choice\nof initialization method.\nIn the four-square experiment, the noise levels were continually adjusted to keep the participants\u2019\nperformance at approximately 71% using the stair-casing algorithm [8]. This performance level is\nhigh enough to keep the participants engaged in the task, but allows for suf\ufb01cient noise to explore\ntheir responses in a large volume of the stimulus space. After an initial adaptation period, the\nnoise level remains relatively constant across trials, so the inter-trial dependence introduced by the\nstair-casing can be safely ignored. Two simulated observers were created to validate GRIFT on\nthe four-square task. Each used a GRIFT model with pre-speci\ufb01ed parameters to probabilistically\nclassify four-square data at a \ufb01xed noise level, which was chosen to produce approximately 70%\ncorrect performance. The corners observer used four feature detectors, one for each bright corner,\nwhereas the top-v.-bottom observer contrasted the brightness of the top and bottom pixels.\nThe result of using GRIFT to recover the feature detectors are displayed in Figure 2. Only the\n\u03c9 parameters are displayed because they are the most informative. Dark pixels indicate negative\nweights and bright pixels correspond to positive weights. The presence of dark and light regions in a\nfeature detector indicates the computation of contrasts between those areas. The sign of the weights\nis not signi\ufb01cant \u2014 given a \ufb01xed number of features, there are typically several equivalent sets of\nfeature detectors that only differ from each other in the signs of their \u03c9 terms and in the associated\n\u03bb and \u03b2 values.\nBecause the optimal number of features for human subjects is unknown, GRIFT models with 1\u20134\nfeatures were \ufb01t to the data from each subject. The correct number of features could be determined\nby holding out a test set or by performing cross-validation. Simulation demonstrated that a reliable\ntest set would need to contain nearly all of the gathered samples, and computational expense made\ncross-validation impractical with our current MATLAB implementation. Instead, after recovering\nthe parameters, we estimated the mutual information between the unobserved F variables and the\nobserved classi\ufb01cations C. Mutual information measures how well the feature detector outputs can\n\n5\n\n123450.050.10.150.2simulations12340.050.10.150.20.25humansN=1N=2N=3N=4four square: most probable \u03c9i valueshumanssimulations-+Nmutual informationJG12340.050.10.150.20.25  abcdRS12340.050.10.150.20.25  abcdcorners1234567891000.10.20.30.40.50.60.70.80.91  ztop v. bottom12340.050.10.150.20.25  abcdEA1234567891000.10.20.30.40.50.60.70.80.91  z12340.050.10.150.20.25  abcdAC\fpredict the subject\u2019s classi\ufb01cation decision. Unlike the log likelihood of the observations, which is\ndependent on the choice to model C with a logistic regression function, mutual information does\nnot assume a particular relationship between F and C and does not necessarily increase with N.\nPlotting the mutual information as N increases can indicate if new detectors are making a substantial\ncontribution or are over\ufb01tting the data. On the simulated observers\u2019 data, for which the true values of\nN were known, mutual information was a more accurate model selection indicator than traditional\nstatistics such as the Bayesian or Akaike information criteria [3].\nFitting GRIFT to the simulated observers demonstrated that if the model is accurate, the correct\nfeatures can be recovered reliably. The top-v.-bottom observer showed no substantial increase in\nmutual information as the number of features increased from 1 to 4. Each set of recovered feature\ndetectors included a top-bottom contrast detector and other detectors with noisy \u03c9is that did not\ncontribute much to predicting C. Although the observer truly used two detectors, one top-brighter\ndetector and one bottom-brighter detector, the recovery of only one top-bottom contrast detector is\na success because one contrast detector plus a suitable \u03bb0 term is logically equivalent to the original\ntwo-feature model. The corners observer showed a substantial increase in mutual information as N\nincreased from 1 to 4 and the \u03c9 values clearly indicate four corner-sensitive feature detectors. The\ncorners data was also tested with a \ufb01ve-feature GRIFT model (\u03c9 not shown) which produced four\ncorner detectors and one feature with noisy \u03c9i. Its gain in mutual information was smaller than that\nobserved on any of the previous steps. Note the corner areas in the \u03c9is recovered from the corners\ndata are sometimes black and sometimes white. Recall that these are not image pixel values that the\ndetectors are attempting to match, but positive and negative weights indicating that the brightness in\nthe corner region is being contrasted to the brightness of the rest of the image.\nEven though targets consisted of four bright-corner stimuli, recovering the parameters from the top-\nv.-bottom observer never produced \u03c9 values indicating corner-speci\ufb01c feature detectors. An impor-\ntant advantages of GRIFT over previous methods such as [4] is that targets will not \u201ccontaminate\u201d\nthe recovered detectors. The simulations demonstrate that the recovered detectors are determined by\nthe classi\ufb01cation strategy, not by the structure of the targets and classes.\nThe data of the four human participants revealed some interesting differences. Participants EA and\nRS were naive, while AC and JG were not. The largest disparity was between EA and JG. EA\u2019s\ndata indicated no consistent pattern of mutual information increase after two features, and the two-\nfeature model appears to contain two top-bottom contrast detectors. Therefore, it is reasonable to\nconclude that EA was not explicitly detecting the corners. At the other extreme is participant JG,\nwhose data shows four very clear corner detectors and a steady increase in mutual information up to\nfour features. Therefore, it seems very likely that this participant was matching corners and probably\nshould be tested with a \ufb01ve-feature model to gain additional insight. AC and RS\u2019s data suggest three\ncorner detectors and a top-bottom contrast detector. GRIFT\u2019s output indicates qualitative differences\nin the classi\ufb01cation strategies used by the four human participants.\nAcross all participants, the best one-feature model was based on the contrast between the top of the\nimage and the bottom. This is extremely similar to the result produced by a classi\ufb01cation image of\nthe data, reinforcing the strong similarity between one-feature GRIFT and that approach.\nIn the light-dark and faces experiments, stair-casing was used the adjust the noise level to the 71%\nperformance level at the beginning of each session and then the noise level was \ufb01xed for the remain-\ning trials to improve the independence of the samples. Participants were paid and promised a $10\nreward for achieving the highest score on the task.\nParticipants P1, P2, and P3 classi\ufb01ed the light-dark stimuli. P1 and P2 achieved at or above the ex-\npected performance level (82% and 73% accuracy), while P3\u2019s performance was near chance (55%).\nBecause the noise levels were \ufb01xed after the \ufb01rst 101 trials, a participant with good luck at the end\nof that period could experience very high noise levels for the remainder of the experiment, leading\nto poor performance. All three participants appear to have used different classi\ufb01cation methods,\nproviding a very informative contrast. The results of \ufb01tting the GRIFT model are in Figure 3.\nThe \ufb02at mutual information graph and the presence of a feature detector thresholding the overall\nbrightness for each value of N indicate that P1 pursued a one-feature, linear-classi\ufb01er strategy. P2,\non the other hand, clearly employed a multi-feature, non-linear strategy. For N = 1 and N = 2, the\nmost interpretable feature detector is an overall brightness detector, which disappears when N = 3\nand the best \ufb01t model consists of three detectors looking for speci\ufb01c patterns, one for each position a\n\n6\n\n\fFigure 3: The most probable \u03c9 parameters found for the light-dark and faces experiments for differ-\nent N and the mutual information between these feature detectors and the observed classi\ufb01cations.\n\nbright or dark spot can appear. Then when N = 4 the overall brightness detector reappears, added to\nthe three spot detectors. Apparently the spot detectors are only effective if they are all present. With\nonly three available detectors, the overall brightness detector is excluded, but the optimal assignment\nincludes all four detectors. This is the best-\ufb01t model because increasing to N = 5 keeps the mutual\ninformation constant and adds a detector that is active for every stimulus. Always active detectors\nfunction as constant additions to \u03bb0, therefore this is equivalent to the N = 4 solution.\nThe GRIFT models of participant P3 do not show a substantial increase in mutual information as the\nnumber of features rises. This lack of increase leads to the conclusion that the one-feature model is\nprobably the best \ufb01t, and since performance was extremely low, it can be assumed that the subject\nwas reduced to near random guessing much of the time.\nThe clear distinction between the results for all three subjects demonstrates the effectiveness of\nGRIFT and the mutual information measure in distinguishing between classi\ufb01cation strategies.\nThe faces presented the largest computational challenges. The targets were two un\ufb01ltered faces\nfrom Gold et al.\u2019s data set [6], down-sampled to 128x128. After the experiment, the stimuli were\ndown-sampled further to 32x32 and the background surrounding the faces was removed by cropping,\nreducing the stimuli to 26x17. These steps made the algorithm computationally feasible, and reduced\nthe number of parameters so they would be suf\ufb01ciently constrained by the samples.\nThe results for three participants (P4, P5, and P6) are in Figure 3. Participants P4 and P5\u2019s data were\nclearly best \ufb01t by one-feature GRIFT models. Increasing the number of features simply caused the\nalgorithm to add features that were never or always active. Never active features cannot affect the\nclassi\ufb01cation, and, as explained previously, always active features are also super\ufb02uous. P4\u2019s one-\nfeature model clearly places signi\ufb01cant weight near the eyebrows, nose, and other facial features.\nP5\u2019s one-feature weights are much noisier and harder to interpret. This might be related to P5\u2019s poor\nperformance on the task \u2014 only 53% accuracy compared to P4\u2019s 72% accuracy. Perhaps the noise\nlevel was too high and P5 was guessing rather than using image information much of the time.\nParticipant P6\u2019s data did produce a two-feature GRIFT model, albeit one that is dif\ufb01cult to interpret\nand which only caused a small rise in mutual information. Instead of recovering independent part\ndetectors, such as a nose detector and an eye detector, GRIFT extracted two subtly different holistic\nfeature detectors. Given P6\u2019s poor performance (58% accuracy) these features may, like P5\u2019s results,\nbe indicative of a guessing strategy that was not strongly in\ufb02uenced by the image information.\nThe results on faces support the hypothesis that face classi\ufb01cation is holistic and con\ufb01gural, rather\nthan the result of part classi\ufb01cations, especially when individual feature detection is dif\ufb01cult [11].\n\n7\n\nlight-dark: most probable \u03c9i valuesN=1N=2N=3N=4N=5-+P212340.050.10.150.20.25  abcdP31234567891000.10.20.30.40.50.60.70.80.91  zP112340.050.10.150.20.25  abcdN=1N=2N=3faces: most probable \u03c9i valuesP512340.050.10.150.20.25  abcdP61234567891000.10.20.30.40.50.60.70.80.91  zP412340.050.10.150.20.25  abcd-+1234500.050.10.150.2mutual informationNlight-dark12300.010.020.030.04mutual informationNfaces\fAcross these experiments, the data collected were compatible with the original classi\ufb01cation image\nmethod. In fact, the four-square human data were originally analyzed using that algorithm. One of\nthe advantages of GRIFT is that it can reanalyze old data to reveal new information. In the one-\nfeature case, GRIFT enables the use of prior probabilities on the parameters, which may improve\nperformance when data is too scarce for the classi\ufb01cation image approach. Most importantly, \ufb01tting\nmulti-feature GRIFT models can reveal previously hidden non-linear classi\ufb01cation strategies.\n\n6 Conclusion\n\nThis paper has described the GRIFT model for determining the features used in human image classi-\n\ufb01cation. GRIFT is an advance over previous methods that assume a single linear classi\ufb01er on pixels\nbecause it describes classi\ufb01cation as the combination of multiple independently detected features. It\nprovides a probabilistic model of human visual classi\ufb01cation that accounts for data and incorporates\nprior beliefs about the features. The feature detectors it \ufb01nds are associated with the classi\ufb01cation\nstrategy employed by the observer and are not the result of structure in the classes\u2019 target images.\nGRIFT\u2019s value has been demonstrated by modeling the performance of humans on the four-square,\nlight-dark, and faces classi\ufb01cation tasks and by successfully recovering the parameters of computer\nsimulated observers in the four-square task. Its inability to \ufb01nd multiple local features when analyz-\ning human performance on the faces task agrees with previous results.\nOne of the strengths of the graphical model approach is that it allows easy replacement of model\ncomponents. An expert can easily change the prior distributions on the parameters to re\ufb02ect knowl-\nedge gained in previous experiments. For example, it might be desirable to encourage the formation\nof edge detectors. New resolution-independent feature parameterizations can be introduced, as can\ntransformation parameters to make the features translationally and rotationally invariant. If the fea-\ntures have explicitly parameterized locations and orientations, the model could be extended to model\ntheir joint relative positions, which might provide more information about domains such as face clas-\nsi\ufb01cation. The success of this version of GRIFT provides a \ufb01rm foundation for these improvements.\n\nAcknowledgments\n\nThis research was supported by NSF Grant SES-0631602 and NIMH grant MH16745. The authors\nthank the reviewers, Tom Grif\ufb01ths, Erik Learned-Miller, and Adam Sanborn for their suggestions.\n\nReferences\n[1] A.J. Ahumada, Jr. Classi\ufb01cation image weights and internal noise level estimation. Journal of Vision,\n\n2(1), 2002.\n\n[2] C.M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995.\n[3] C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n[4] A.L. Cohen, R.M. Shiffrin, J.M. Gold, D.A. Ross, and M.G. Ross. Inducing features from visual noise.\n\nJournal of Vision, 7(8), 2007.\n\n[5] A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin. Bayesian Data Analysis. Chapman & Hall/CRC,\n\n2003.\n\n[6] J.M. Gold, P.J. Bennett, and A.B. Sekuler. Identi\ufb01cation of band-pass \ufb01ltered letters and faces by human\n\nand ideal observers. Vision Research, 39, 1999.\n\n[7] J.M. Gold, A.L. Cohen, and R. Shiffrin. Visual noise reveals category representations. Psychonomics\n\nBulletin & Review, 15(4), 2006.\n\n[8] N.A. Macmillan and C.D. Creelman. Detection Theory: A User\u2019s Guide. Lawrence Erlbaum Associates,\n\n2005.\n\n[9] S.E. Palmer. Vision Science: Photons to Phenomenology. The MIT Press, 1999.\n[10] D.G. Pelli, B. Farell, and D.C. Moore. The remarkable inef\ufb01ciency of word recognition. Nature, 425,\n\n2003.\n\n[11] J. Sergent. An investigation into component and con\ufb01gural processes underlying face perception. British\n\nJournal of Psychology, 75, 1984.\n\n8\n\n\f", "award": [], "sourceid": 471, "authors": [{"given_name": "Michael", "family_name": "Ross", "institution": null}, {"given_name": "Andrew", "family_name": "Cohen", "institution": null}]}