{"title": "HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3449, "page_last": 3461, "abstract": "Generative models often use human evaluations to measure the perceived quality of their outputs. Automated metrics are noisy indirect proxies, because they rely on heuristics or pretrained embeddings. However, up until now, direct human evaluation strategies have been ad-hoc, neither standardized nor validated. Our work establishes a gold standard human benchmark for generative realism. We construct Human eYe Perceptual Evaluation (HYPE) a human benchmark that is (1) grounded in psychophysics research in perception, (2) reliable across different sets of randomly sampled outputs from a model, (3) able to produce separable model performances, and (4) efficient in cost and time. We introduce two variants: one that measures visual perception under adaptive time constraints to determine the threshold at which a model's outputs appear real (e.g. $250$ms), and the other a less expensive variant that measures human error rate on fake and real images sans time constraints. We test HYPE across six state-of-the-art generative adversarial networks and two sampling techniques on conditional and unconditional image generation using four datasets: CelebA, FFHQ, CIFAR-10, and ImageNet. We find that HYPE can track model improvements across training epochs, and we confirm via bootstrap sampling that HYPE rankings are consistent and replicable.", "full_text": "HYPE: A Benchmark for Human eYe Perceptual\n\nEvaluation of Generative Models\n\nSharon Zhou\u2217, Mitchell L. Gordon\u2217, Ranjay Krishna,\nAustin Narcomey, Li Fei-Fei, Michael S. Bernstein\n\nStanford University\n\n{sharonz, mgord, ranjaykrishna, aon2, feifeili, msb}@cs.stanford.edu\n\nAbstract\n\nGenerative models often use human evaluations to measure the perceived quality\nof their outputs. Automated metrics are noisy indirect proxies, because they rely\non heuristics or pretrained embeddings. However, up until now, direct human eval-\nuation strategies have been ad-hoc, neither standardized nor validated. Our work\nestablishes a gold standard human benchmark for generative realism. We construct\nHUMAN EYE PERCEPTUAL EVALUATION (HYPE), a human benchmark that is\n(1) grounded in psychophysics research in perception, (2) reliable across different\nsets of randomly sampled outputs from a model, (3) able to produce separable\nmodel performances, and (4) ef\ufb01cient in cost and time. We introduce two variants:\none that measures visual perception under adaptive time constraints to determine\nthe threshold at which a model\u2019s outputs appear real (e.g. 250ms), and the other a\nless expensive variant that measures human error rate on fake and real images sans\ntime constraints. We test HYPE across six state-of-the-art generative adversarial\nnetworks and two sampling techniques on conditional and unconditional image\ngeneration using four datasets: CelebA, FFHQ, CIFAR-10, and ImageNet. We \ufb01nd\nthat HYPE can track the relative improvements between models, and we con\ufb01rm\nvia bootstrap sampling that these measurements are consistent and replicable.\n\nFigure 1: Our human evaluation metric, HYPE, consistently distinguishes models from each other:\nhere, we compare different generative models performance on FFHQ. A score of 50% represents\nindistinguishable results from real, while a score above 50% represents hyper-realism.\n\n1\n\nIntroduction\n\nGenerating realistic images is regarded as a focal task for measuring the progress of generative models.\nAutomated metrics are either heuristic approximations [49, 52, 14, 26, 9, 45] or intractable density\nestimations, examined to be inaccurate on high dimensional problems [24, 7, 55]. Human evaluations,\nsuch as those given on Amazon Mechanical Turk [49, 14], remain ad-hoc because \u201cresults change\ndrastically\u201d [52] based on details of the task design [36, 34, 27]. With both noisy automated and noisy\nhuman benchmarks, measuring progress over time has become akin to hill-climbing on noise. Even\nwidely used metrics, such as Inception Score [52] and Fr\u00e9chet Inception Distance [23], have been\ndiscredited for their application to non-ImageNet datasets [3, 48, 8, 46]. Thus, to monitor progress,\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fgenerative models need a systematic gold standard benchmark. In this paper, we introduce a gold\nstandard benchmark for realistic generation, demonstrating its effectiveness across four datasets, six\nmodels, and two sampling techniques, and using it to assess the progress of generative models over\ntime.\nRealizing the constraints of available automated metrics, many generative modeling tasks resort to\nhuman evaluation and visual inspection [49, 52, 14]. These human measures are (1) ad-hoc, each\nexecuted in idiosyncrasy without proof of reliability or grounding to theory, and (2) high variance in\ntheir estimates [52, 14, 42]. These characteristics combine to a lack of reliability, and downstream,\n(3) a lack of clear separability between models. Theoretically, given suf\ufb01ciently large sample sizes of\nhuman evaluators and model outputs, the law of large numbers would smooth out the variance and\nreach eventual convergence; but this would occur at (4) a high cost and a long delay.\nWe present HYPE (HUMAN EYE PERCEPTUAL EVALUATION) to address these criteria in turn.\nHYPE: (1) measures the perceptual realism of generative model outputs via a grounded method\ninspired by psychophysics methods in perceptual psychology, (2) is a reliable and consistent estimator,\n(3) is statistically separable to enable a comparative ranking, and (4) ensures a cost and time ef\ufb01cient\nmethod through modern crowdsourcing techniques such as training and aggregation. We present\ntwo methods of evaluation. The \ufb01rst, called HYPEtime, is inspired directly by the psychophysics\nliterature [28, 11], and displays images using adaptive time constraints to determine the time-limited\nperceptual threshold a person needs to distinguish real from fake. The HYPEtime score is understood\nas the minimum time, in milliseconds, that a person needs to see the model\u2019s output before they can\ndistinguish it as real or fake. For example, a score of 500ms on HYPEtime indicates that humans\ncan distinguish model outputs from real images at 500ms exposure times or longer, but not under\n500ms. The second method, called HYPE\u221e, is derived from the \ufb01rst to make it simpler, faster, and\ncheaper while maintaining reliability. It is interpretable as the rate at which people mistake fake\nimages and real images, given unlimited time to make their decisions. A score of 50% on HYPE\u221e\nmeans that people differentiate generated results from real data at chance rate, while a score above\n50% represents hyper-realism in which generated images appear more real than real images.\nWe run two large-scale experiments. First, we demonstrate HYPE\u2019s performance on unconditional\nhuman face generation using four popular generative adversarial networks (GANs) [20, 5, 25, 26]\nacross CelebA-64 [37]. We also evaluate two newer GANs [41, 9] on FFHQ-1024 [26]. HYPE\nindicates that GANs have clear, measurable perceptual differences between them; this ranking is\nidentical in both HYPEtime and HYPE\u221e. The best performing model, StyleGAN trained on FFHQ\nand sampled with the truncation trick, only performs at 27.6% HYPE\u221e, suggesting substantial\nopportunity for improvement. We can reliably reproduce these results with 95% con\ufb01dence intervals\nusing 30 human evaluators at $60 in a task that takes 10 minutes.\nSecond, we demonstrate the performance of HYPE\u221e beyond faces on conditional generation of \ufb01ve\nobject classes in ImageNet [13] and unconditional generation of CIFAR-10 [31]. Early GANs such as\nBEGAN are not separable in HYPE\u221e when generating CIFAR-10: none of them produce convincing\nresults to humans, verifying that this is a harder task than face generation. The newer StyleGAN\nshows separable improvement, indicating progress over the previous models. With ImageNet-5,\nGANs have improved on classes considered \u201ceasier\u201d to generate (e.g., lemons), but resulted in\nconsistently low scores across all models for harder classes (e.g., French horns).\nHYPE is a rapid solution for researchers to measure their generative models, requiring just a single\nclick to produce reliable scores and measure progress. We deploy HYPE at https://hype.stanford.edu,\nwhere researchers can upload a model and retrieve a HYPE score. Future work will extend HYPE to\nadditional generative tasks such as text, music, and video generation.\n\n2 HYPE: A benchmark for HUMAN EYE PERCEPTUAL EVALUATION\n\nHYPE displays a series of images one by one to crowdsourced evaluators on Amazon Mechanical\nTurk and asks the evaluators to assess whether each image is real or fake. Half of the images are real\nimages, drawn from the model\u2019s training set (e.g., FFHQ, CelebA, ImageNet, or CIFAR-10). The\nother half are drawn from the model\u2019s output. We use modern crowdsourcing training and quality\ncontrol techniques [40] to ensure high-quality labels. Model creators can choose to perform two\ndifferent evaluations: HYPEtime, which gathers time-limited perceptual thresholds to measure the\n\n2\n\n\fFigure 2: Example images sampled with the truncation trick from StyleGAN trained on FFHQ.\nImages on the right exhibit the highest HYPE\u221e scores, the highest human perceptual \ufb01delity.\n\npsychometric function and report the minimum time people need to make accurate classi\ufb01cations,\nand HYPE\u221e, a simpli\ufb01ed approach which assesses people\u2019s error rate under no time constraint.\n\n2.1 HYPEtime: Perceptual \ufb01delity grounded in psychophysics\n\nOur \ufb01rst method, HYPEtime, measures time-limited perceptual thresholds.\nIt is rooted in psy-\nchophysics literature, a \ufb01eld devoted to the study of how humans perceive stimuli, to evaluate human\ntime thresholds upon perceiving an image. Our evaluation protocol follows the procedure known\nas the adaptive staircase method (Figure 3) [11]. An image is \ufb02ashed for a limited length of time,\nafter which the evaluator is asked to judge whether it is real or fake. If the evaluator consistently\nanswers correctly, the staircase descends and \ufb02ashes the next image with less time. If the evaluator is\nincorrect, the staircase ascends and provides more time.\nThis process requires suf\ufb01cient iterations to con-\nverge to the evaluator\u2019s perceptual threshold: the\nshortest exposure time at which they can main-\ntain effective performance [11, 19, 15]. The\nprocess produces what is known as the psycho-\nmetric function [60], the relationship of timed\nstimulus exposure to accuracy. For example,\nfor an easily distinguishable set of generated\nimages, a human evaluator would immediately\ndrop to the lowest millisecond exposure.\nHYPEtime displays three blocks of staircases for\neach evaluator. An image evaluation begins with\na 3-2-1 countdown clock, each number display-\ning for 500ms [30]. The sampled image is then\ndisplayed for the current exposure time. Immedi-\nately after each image, four perceptual mask im-\nages are rapidly displayed for 30ms each. These\nnoise masks are distorted to prevent retinal after-\nimages and further sensory processing after the\nimage disappears [19]. We generate masks us-\ning an existing texture-synthesis algorithm [44].\nUpon each submission, HYPEtime reveals to the\nevaluator whether they were correct.\nImage exposures are in the range [100ms, 1000ms], derived from the perception literature [17].\nAll blocks begin at 500ms and last for 150 images (50% generated, 50% real), values empirically\ntuned from prior work [11, 12]. Exposure times are raised at 10ms increments and reduced at 30ms\ndecrements, following the 3-up/1-down adaptive staircase approach, which theoretically leads to a\n75% accuracy threshold that approximates the human perceptual threshold [35, 19, 11].\nEvery evaluator completes multiple staircases, called blocks, on different sets of images. As a result,\nwe observe multiple measures for the model. We employ three blocks, to balance quality estimates\nagainst evaluators\u2019 fatigue [32, 50, 22]. We average the modal exposure times across blocks to\n\nFigure 3: The adaptive staircase method shows\nimages to evaluators at different time exposures,\ndecreasing when correct and increasing when in-\ncorrect. The modal exposure measures their per-\nceptual threshold.\n\n3\n\n\fcalculate a \ufb01nal value for each evaluator. Higher scores indicate a better model, whose outputs take\nlonger time exposures to discern from real.\n\n2.2 HYPE\u221e: Cost-effective approximation\n\nBuilding on the previous method, we introduce HYPE\u221e: a simpler, faster, and cheaper method after\nablating HYPEtime to optimize for speed, cost, and ease of interpretation. HYPE\u221e shifts from a\nmeasure of perceptual time to a measure of human deception rate, given in\ufb01nite evaluation time. The\nHYPE\u221e score gauges total error on a task of 50 fake and 50 real images 2, enabling the measure to\ncapture errors on both fake and real images, and effects of hyperrealistic generation when fake images\nlook even more realistic than real images 3. HYPE\u221e requires fewer images than HYPEtime to \ufb01nd a\nstable value, empirically producing a 6x reduction in time and cost (10 minutes per evaluator instead\nof 60 minutes, at the same rate of $12 per hour). Higher scores are again better: 10% HYPE\u221e\nindicates that only 10% of images deceive people, whereas 50% indicates that people are mistaking\nreal and fake images at chance, rendering fake images indistinguishable from real. Scores above 50%\nsuggest hyperrealistic images, as evaluators mistake images at a rate greater than chance.\nHYPE\u221e shows each evaluator a total of 100 images: 50 real and 50 fake. We calculate the proportion\nof images that were judged incorrectly, and aggregate the judgments over the n evaluators on k\nimages to produce the \ufb01nal score for a given model.\n\n2.3 Consistent and reliable design\n\nTo ensure that our reported scores are consistent and reliable, we need to sample suf\ufb01ciently from the\nmodel as well as hire, qualify, and appropriately pay enough evaluators.\nSampling suf\ufb01cient model outputs. The selection of K images to evaluate from a particular model\nis a critical component of a fair and useful evaluation. We must sample a large enough number of\nimages that fully capture a model\u2019s generative diversity, yet balance that against tractable costs in\nthe evaluation. We follow existing work on evaluating generative output by sampling K = 5000\ngenerated images from each model [52, 41, 58] and K = 5000 real images from the training set.\nFrom these samples, we randomly select images to give to each evaluator.\nQuality of evaluators. To obtain a high-quality pool of evaluators, each is required to pass a\nquali\ufb01cation task. Such a pre-task \ufb01ltering approach, sometimes referred to as a person-oriented\nstrategy, is known to outperform process-oriented strategies that perform post-task data \ufb01ltering or\nprocessing [40]. Our quali\ufb01cation task displays 100 images (50 real and 50 fake) with no time limits.\nEvaluators must correctly classify 65% of both real and fake images. This threshold should be treated\nas a hyperparameter and may change depending upon the GANs used in the tutorial and the desired\ndiscernment ability of the chosen evaluators. We choose 65% based on the cumulative binomial\nprobability of 65 binary choice answers out of 100 total answers: there is only a one in one-thousand\nchance that an evaluator will qualify by random guessing. Unlike in the task itself, fake quali\ufb01cation\nimages are drawn equally from multiple different GANs to ensure an equitable quali\ufb01cation across\nall GANs. The quali\ufb01cation is designed to be taken occasionally, such that a pool of evaluators can\nassess new models on demand.\nPayment. Evaluators are paid a base rate of $1 for working on the quali\ufb01cation task. To incentivize\nevaluators to remained engaged throughout the task, all further pay after the quali\ufb01cation comes from\na bonus of $0.02 per correctly labeled image, typically totaling a wage of $12/hr.\n\n3 Experimental setup\n\nDatasets. We evaluate on four datasets. (1) CelebA-64 [37] is popular dataset for unconditional\nimage generation with 202k images of human faces, which we align and crop to be 64 \u00d7 64 px. (2)\nFFHQ-1024 [26] is a newer face dataset with 70k images of size 1024 \u00d7 1024 px. (3) CIFAR-10\n2We explicitly reveal this ratio to evaluators. Amazon Mechanical Turk forums would enable evaluators to\ndiscuss and learn about this distribution over time, thus altering how different evaluators would approach the\ntask. By making this ratio explicit, evaluators would have the same prior entering the task.\n\n3Hyper-realism is relative to the real dataset on which a model is trained. Some datasets already look less\n\nrealistic because of lower resolution and/or lower diversity of images.\n\n4\n\n\fconsists of 60k images, sized 32 \u00d7 32 px, across 10 classes. (4) ImageNet-5 is a subset of 5 classes\nwith 6.5k images at 128 \u00d7 128 px from the ImageNet dataset [13], which have been previously\nidenti\ufb01ed as easy (lemon, Samoyed, library) and hard (baseball player, French horn) [9].\nArchitectures. We evaluate on four state-of-the-art models trained on CelebA-64 and CIFAR-10:\nStyleGAN [26], ProGAN [25], BEGAN [5], and WGAN-GP [20]. We also evaluate on two models,\nSN-GAN [41] and BigGAN [9] trained on ImageNet, sampling conditionally on each class in\nImageNet-5. We sample BigGAN with (\u03c3 = 0.5 [9]) and without the truncation trick.\nWe also evaluate on StyleGAN [26] trained on FFHQ-1024 with (\u03c8 = 0.7 [26]) and without\ntruncation trick sampling. For parity on our best models across datasets, StyleGAN instances trained\non CelebA-64 and CIFAR-10 are also sampled with the truncation trick.\nWe sample noise vectors from the d-dimensional spherical Gaussian noise prior z \u2208 Rd \u223c N (0, I)\nduring training and test times. We speci\ufb01cally opted to use the same standard noise prior for\ncomparison, yet are aware of other priors that optimize for FID and IS scores [9]. We select training\nhyperparameters published in the corresponding papers for each model.\nEvaluator recruitment. We recruit 930 evaluators from Amazon Mechanical Turk, or 30 for each\nrun of HYPE. We explain our justi\ufb01cation for this number in the Cost tradeoffs section. To maintain a\nbetween-subjects study in this evaluation, we recruit independent evaluators across tasks and methods.\nMetrics. For HYPEtime, we report the modal perceptual threshold in milliseconds. For HYPE\u221e,\nwe report the error rate as a percentage of images, as well as the breakdown of this rate on real and\nfake images separately. To show that our results for each model are separable, we report a one-way\nANOVA with Tukey pairwise post-hoc tests to compare all models.\nReliability is a critical component of HYPE, as a benchmark is not useful if a researcher receives a\ndifferent score when rerunning it. We use bootstrapping [16], repeated resampling from the empirical\nlabel distribution, to measure variation in scores across multiple samples with replacement from a set\nof labels. We report 95% bootstrapped con\ufb01dence intervals (CIs), along with standard deviation of\nthe bootstrap sample distribution, by randomly sampling 30 evaluators with replacement from the\noriginal set of evaluators across 10, 000 iterations.\nExperiment 1: We run two large-scale experi-\nments to validate HYPE. The \ufb01rst one focuses\non the controlled evaluation and comparison of\nHYPEtime against HYPE\u221e on established hu-\nman face datasets. We recorded responses to-\ntaling (4 CelebA-64 + 2 FFHQ-1024) models\n\u00d7 30 evaluators \u00d7 550 responses = 99k total\nresponses for our HYPEtime evaluation and (4\nCelebA-64 + 2 FFHQ-1024) models \u00d7 30 evaluators \u00d7 100 responses = 18k, for our HYPE\u221e\nevaluation.\nExperiment 2: The second experiment evaluates HYPE\u221e on general image datasets. We recorded\n(4 CIFAR-10 + 3 ImageNet-5) models \u00d7 30 evaluators \u00d7 100 responses = 57k total responses.\n\nTable 1: HYPEtime on StyleGANtrunc and\nStyleGANno-trunc trained on FFHQ-1024.\n\nStyleGANtrunc\nStyleGANno-trunc\n\nHYPEtime (ms)\n\nStd.\n32.1\n29.9\n\n95% CI\n\n300.0 \u2013 424.3\n184.7 \u2013 302.7\n\nRank GAN\n\n1\n2\n\n363.2\n240.7\n\n4 Experiment 1: HYPEtime and HYPE\u221e on human faces\n\nWe report results on HYPEtime and demonstrate that the results of HYPE\u221e approximates those from\nHYPEtime at a fraction of the cost and time.\n\n4.1 HYPEtime\n\nCelebA-64. We \ufb01nd that StyleGANtrunc resulted in the highest HYPEtime score (modal exposure\ntime), at a mean of 439.3ms, indicating that evaluators required nearly a half-second of exposure\nto accurately classify StyleGANtrunc images (Table 1). StyleGANtrunc is followed by ProGAN at\n363.7ms, a 17% drop in time. BEGAN and WGAN-GP are both easily identi\ufb01able as fake, tied in last\nplace around the minimum available exposure time of 100ms. Both BEGAN and WGAN-GP exhibit\na bottoming out effect \u2014 reaching the minimum time exposure of 100ms quickly and consistently.4\n4We do not pursue time exposures under 100ms due to constraints on JavaScript browser rendering times.\n\n5\n\n\fTo demonstrate separability between models we report results from a one-way analysis of variance\n(ANOVA) test, where each model\u2019s input is the list of modes from each model\u2019s 30 evaluators.\nThe ANOVA results con\ufb01rm that there is a statistically signi\ufb01cant omnibus difference (F (3, 29) =\n83.5, p < 0.0001). Pairwise post-hoc analysis using Tukey tests con\ufb01rms that all pairs of models are\nseparable (all p < 0.05) except BEGAN and WGAN-GP (n.s.).\nFFHQ-1024. We \ufb01nd that StyleGANtrunc resulted in a higher exposure time than StyleGANno-trunc,\nat 363.2ms and 240.7ms, respectively (Table 1). While the 95% con\ufb01dence intervals that represent a\nvery conservative overlap of 2.7ms, an unpaired t-test con\ufb01rms that the difference between the two\nmodels is signi\ufb01cant (t(58) = 2.3, p = 0.02).\n\n4.2 HYPE\u221e\nCelebA-64. Table 2 reports results for HYPE\u221e on CelebA-64. We \ufb01nd that StyleGANtrunc resulted\nin the highest HYPE\u221e score, fooling evaluators 50.7% of the time. StyleGANtrunc is followed\nby ProGAN at 40.3%, BEGAN at 10.0%, and WGAN-GP at 3.8%. No con\ufb01dence intervals are\noverlapping and an ANOVA test is signi\ufb01cant (F (3, 29) = 404.4, p < 0.001). Pairwise post-hoc\nTukey tests show that all pairs of models are separable (all p < 0.05). Notably, HYPE\u221e results in\nseparable results for BEGAN and WGAN-GP, unlike in HYPEtime where they were not separable\ndue to a bottoming-out effect.\n\nTable 2: HYPE\u221e on four GANs trained on CelebA-64. Counterintuitively, real errors increase with\nthe errors on fake images, because evaluators become more confused and distinguishing factors\nbetween the two distributions become harder to discern.\nFakes Error Reals Error\nRank GAN\n\nHYPE\u221e (%)\n\nPrecision\n\n95% CI\n\n1\n2\n3\n4\n\nStyleGANtrunc\nProGAN\nBEGAN\nWGAN-GP\n\n50.7%\n40.3%\n10.0%\n3.8%\n\n62.2%\n46.2%\n6.2%\n1.7%\n\n39.3%\n34.4%\n13.8%\n5.9%\n\nStd.\n1.3\n0.9\n1.6\n0.6\n\n48.2 \u2013 53.1\n38.5 \u2013 42.0\n7.2 \u2013 13.3\n3.2 \u2013 5.7\n\nKID\n0.005\n0.001\n0.056\n0.046\n\nFID\n131.7\n2.5\n67.7\n43.6\n\n0.982\n0.990\n0.326\n0.654\n\nFFHQ-1024. We observe a consistently separable difference between StyleGANtrunc and\nStyleGANno-trunc and clear delineations between models (Table 3). HYPE\u221e ranks StyleGANtrunc\n(27.6%) above StyleGANno-trunc (19.0%) with no overlapping CIs. Separability is con\ufb01rmed by an\nunpaired t-test (t(58) = 8.3, p < 0.001).\n\nTable 3: HYPE\u221e on StyleGANtrunc and StyleGANno-trunc trained on FFHQ-1024. Evaluators were\ndeceived most often by StyleGANtrunc. Similar to CelebA-64, fake errors and real errors track each\nother as the line between real and fake distributions blurs.\nRank GAN\nFakes Error Reals Error\n\nHYPE\u221e (%)\n\n95% CI\n\nStd.\n2.4\n1.8\n\n22.9 \u2013 32.4\n15.5 \u2013 22.4\n\nKID\n0.007\n0.001\n\nFID Precision\n13.8\n4.4\n\n0.976\n0.983\n\n1\n2\n\nStyleGANtrunc\nStyleGANno-trunc\n\n27.6%\n19.0%\n\n28.4%\n18.5%\n\n26.8%\n19.5%\n\n4.3 Cost tradeoffs with accuracy and time\n\nOne of HYPE\u2019s goals is to be cost and time ef\ufb01cient. When running HYPE, there is an inherent\ntradeoff between accuracy and time, as well as between accuracy and cost. This is driven by the\nlaw of large numbers: recruiting additional evaluators in a crowdsourcing task often produces more\nconsistent results, but at a higher cost (as each evaluator is paid for their work) and a longer amount\nof time until completion (as more evaluators must be recruited and they must complete their work).\nTo manage this tradeoff, we run an experiment with HYPE\u221e on StyleGANtrunc. We completed\nan additional evaluation with 60 evaluators, and compute 95% bootstrapped con\ufb01dence intervals,\nchoosing from 10 to 120 evaluators (Figure 4). We see that the CI begins to converge around 30\nevaluators, our recommended number of evaluators to recruit.\nPayment to evaluators was calculated as described in the Approach section. At 30 evaluators, the cost\nof running HYPEtime on one model was approximately $360, while the cost of running HYPE\u221e on\nthe same model was approximately $60. Payment per evaluator for both tasks was approximately\n\n6\n\n\f$12/hr. Evaluators spent an average of one hour each on a HYPEtime task and 10 minutes each\non a HYPE\u221e task. Thus, HYPE\u221e achieves its goals of being signi\ufb01cantly cheaper to run, while\nmaintaining consistency.\n\n4.4 Comparison to automated metrics\n\nAs FID [23] is one of the most frequently used eval-\nuation methods for unconditional image generation,\nit is imperative to compare HYPE against FID on\nthe same models. We also compare to two newer\nautomated metrics: KID [6], an unbiased estima-\ntor independent of sample size, and F1/8 (preci-\nsion) [51], which captures \ufb01delity independently. We\nshow through Spearman rank-order correlation coef\ufb01-\ncients that HYPE scores are not correlated with FID\n(\u03c1 = \u22120.029, p = 0.96), where a Spearman correla-\ntion of \u22121.0 is ideal because lower FID and higher\nHYPE scores indicate stronger models. We therefore\n\ufb01nd that FID is not highly correlated with human\nFigure 4: Effect of more evaluators on CI.\njudgment. Meanwhile, HYPEtime and HYPE\u221e ex-\nhibit strong correlation (\u03c1 = 1.0, p = 0.0), where 1.0 is ideal because they are directly related. We\ncalculate FID across the standard protocol of 50K generated and 50K real images for both CelebA-64\nand FFHQ-1024, reproducing scores for StyleGANno-trunc. KID (\u03c1 = \u22120.609, p = 0.20) and preci-\nsion (\u03c1 = 0.657, p = 0.16) both show a statistically insigni\ufb01cant but medium level of correlation\nwith humans.\n\n4.5 HYPE\u221e during model training\n\nHYPE can also be used to evaluate progress during model training. We \ufb01nd that HYPE\u221e scores\nincreased as StyleGAN training progressed from 29.5% at 4k epochs, to 45.9% at 9k epochs, to\n50.3% at 25k epochs (F (2, 29) = 63.3, p < 0.001).\n\n5 Experiment 2: HYPE\u221e beyond faces\n\nWe now turn to another popular image generation task: objects. As Experiment 1 showed HYPE\u221e\nto be an ef\ufb01cient and cost effective variant of HYPEtime, here we focus exclusively on HYPE\u221e.\n\n5.1\n\nImageNet-5\n\nWe evaluate conditional image generation on \ufb01ve ImageNet classes (Table 4). We also report\nFID [23], KID [6], and F1/8 (precision) [51] scores. To evaluate the relative effectiveness of the\nthree GANs within each object class, we compute \ufb01ve one-way ANOVAs, one for each of the\nobject classes. We \ufb01nd that the HYPE\u221e scores are separable for images from three easy classes:\nsamoyeds (dogs) (F (2, 29) = 15.0, p < 0.001), lemons (F (2, 29) = 4.2, p = 0.017), and libraries\n(F (2, 29) = 4.9, p = 0.009). Pairwise Posthoc tests reveal that this difference is only signi\ufb01cant\nbetween SN-GAN and the two BigGAN variants. We also observe that models have unequal strengths,\ne.g. SN-GAN is better suited to generating libraries than samoyeds.\nComparison to automated metrics. Spearman rank-order correlation coef\ufb01cients on all three GANs\nacross all \ufb01ve classes show that there is a low to moderate correlation between the HYPE\u221e scores\nand KID (\u03c1 = \u22120.377, p = 0.02), FID (\u03c1 = \u22120.282, p = 0.01), and negligible correlation with\nprecision (\u03c1 = \u22120.067, p = 0.81). Some correlation for our ImageNet-5 task is expected, as these\nmetrics use pretrained ImageNet embeddings to measure differences between generated and real data.\nInterestingly, we \ufb01nd that this correlation depends upon the GAN: considering only SN-GAN, we \ufb01nd\nstronger coef\ufb01cients for KID (\u03c1 = \u22120.500, p = 0.39), FID (\u03c1 = \u22120.300, p = 0.62), and precision\n(\u03c1 = \u22120.205, p = 0.74). When considering only BigGAN, we \ufb01nd far weaker coef\ufb01cients for KID\n(\u03c1 = \u22120.151, p = 0.68), FID (\u03c1 = \u22120.067, p = .85), and precision (\u03c1 = \u22120.164, p = 0.65). This\n\n7\n\n\fillustrates an important \ufb02aw with these automatic metrics: their ability to correlate with humans\ndepends upon the generative model that the metrics are evaluating on, varying by model and by task.\n\nTable 4: HYPE\u221e on three models trained on ImageNet and conditionally sampled on \ufb01ve classes.\nBigGAN routinely outperforms SN-GAN. BigGantrunc and BigGanno-trunc are not separable.\n\nFakes Error Reals Error\n\n21.9%\n22.2%\n10.8%\n23.5%\n23.2%\n3.4%\n22.0%\n28.1%\n15.1%\n\n9.0%\n8.6%\n5.0%\n1.9%\n3.3%\n3.6%\n\n14.9%\n18.1%\n13.3%\n16.2%\n16.1%\n8.2%\n12.8%\n17.6%\n12.1%\n\n5.5%\n5.2%\n2.2%\n1.9%\n1.2%\n1.9%\n\nStd.\n2.3\n2.2\n1.6\n2.6\n2.2\n0.9\n2.1\n2.1\n1.9\n\n1.8\n1.4\n1.0\n0.7\n0.6\n1.5\n\n95% CI\n14.2\u201323.1\n16.0\u201324.8\n9.0\u201315.3\n15.0\u201325.1\n15.5\u201324.1\n4.1\u20137.8\n13.3\u201321.6\n18.9\u201327.2\n10.0\u201317.5\n\n4.0\u201311.2\n4.3\u20139.9\n1.8\u20135.9\n0.8\u20133.5\n1.3\u20133.5\n0.8\u20136.2\n\nKID\n0.043\n0.036\n0.053\n0.027\n0.014\n0.046\n0.049\n0.029\n0.043\n\n0.031\n0.042\n0.156\n0.049\n0.026\n0.052\n\nFID\n94.22\n87.54\n117.90\n56.94\n46.14\n88.68\n98.45\n78.49\n94.89\n\n78.21\n96.18\n196.12\n91.31\n76.71\n105.82\n\nPrecision\n\n0.784\n0.774\n0.656\n0.794\n0.906\n0.785\n0.695\n0.814\n0.814\n\n0.732\n0.757\n0.674\n0.853\n0.838\n0.785\n\nGAN\n\ny BigGantrunc\n\ns\na\nE\n\nBigGanno-trunc\nSN-GAN\n\ny BigGantrunc\n\nBigGanno-trunc\nSN-GAN\n\ny BigGantrunc\n\nBigGanno-trunc\nSN-GAN\n\nd BigGantrunc\n\nBigGanno-trunc\nSN-GAN\n\nd BigGantrunc\n\nBigGanno-trunc\nSN-GAN\n\ns\na\nE\n\ns\na\nE\n\nr\na\nH\n\nr\na\nH\n\nClass\nLemon\nLemon\nLemon\nSamoyed\nSamoyed\nSamoyed\nLibrary\nLibrary\nLibrary\n\nFrench Horn\nFrench Horn\nFrench Horn\nBaseball Player\nBaseball Player\nBaseball Player\n\nHYPE\u221e (%)\n\n18.4%\n20.2%\n12.0%\n19.9%\n19.7%\n5.8%\n17.4%\n22.9%\n13.6%\n\n7.3%\n6.9%\n3.6%\n1.9%\n2.2%\n2.8%\n\nTable 5: Four models on CIFAR-10. StyleGANtrunc can generate realistic images from CIFAR-10.\n\nGAN\n\nStyleGANtrunc\n\nPROGAN\nBEGAN\n\nWGAN-GP\n\nHYPE\u221e (%)\n23.3%\n14.8%\n14.5%\n13.2%\n\nFakes Error Reals Error\n\n28.2%\n18.5%\n14.6%\n15.3%\n\n18.5%\n11.0%\n14.5%\n11.1%\n\nStd.\n1.6\n1.6\n1.7\n2.3\n\n95% CI\n20.1\u201326.4\n11.9\u201318.0\n11.3\u201318.1\n9.1\u201318.1\n\nKID\n0.005\n0.001\n0.056\n0.046\n\nFID\n62.9\n53.2\n96.2\n104.0\n\nPrecision\n\n0.982\n0.990\n0.326\n0.654\n\n5.2 CIFAR-10\n\nFor the dif\ufb01cult task of unconditional generation on CIFAR-10, we use the same four model architec-\ntures in Experiment 1: CelebA-64. Table 5 shows that HYPE\u221e was able to separate StyleGANtrunc\nfrom the earlier BEGAN, WGAN-GP, and ProGAN, indicating that StyleGAN is the \ufb01rst among\nthem to make human-perceptible progress on unconditional object generation with CIFAR-10.\nComparison to automated metrics. Spearman rank-order correlation coef\ufb01cients on all four GANs\nshow medium, yet statistically insigni\ufb01cant, correlations with KID (\u03c1 = \u22120.600, p = 0.40) and FID\n(\u03c1 = 0.600, p = 0.40) and precision (\u03c1 = \u2212.800, p = 0.20).\n\n6 Related work\n\nCognitive psychology. We leverage decades of cognitive psychology to motivate how we use\nstimulus timing to gauge the perceptual realism of generated images. It takes an average of 150ms of\nfocused visual attention for people to process and interpret an image, but only 120ms to respond to\nfaces because our inferotemporal cortex has dedicated neural resources for face detection [47, 10].\nPerceptual masks are placed between a person\u2019s response to a stimulus and their perception of\nit to eliminate post-processing of the stimuli after the desired time exposure [53]. Prior work in\ndetermining human perceptual thresholds [19] generates masks from their test images using the\ntexture-synthesis algorithm [44]. We leverage this literature to establish feasible lower bounds on the\nexposure time of images, the time between images, and the use of noise masks.\nSuccess of automatic metrics. Common generative modeling tasks include realistic image genera-\ntion [18], machine translation [1], image captioning [57], and abstract summarization [39], among\nothers. These tasks often resort to automatic metrics like the Inception Score (IS) [52] and Fr\u00e9chet In-\nception Distance (FID) [23] to evaluate images and BLEU [43], CIDEr [56] and METEOR [2] scores\nto evaluate text. While we focus on how realistic generated content appears, other automatic metrics\nalso measure diversity of output, over\ufb01tting, entanglement, training stability, and computational and\nsample ef\ufb01ciency of the model [8, 38, 3]. Our metric may also capture one aspect of output diversity,\n\n8\n\n\finsofar as human evaluators can detect similarities or patterns across images. Our evaluation is not\nmeant to replace existing methods but to complement them.\nLimitations of automatic metrics. Prior work has asserted that there exists coarse correlation\nof human judgment to FID [23] and IS [52], leading to their widespread adoption. Both metrics\ndepend on the Inception-v3 Network [54], a pretrained ImageNet model, to calculate statistics on\nthe generated output (for IS) and on the real and generated distributions (for FID). The validity of\nthese metrics when applied to other datasets has been repeatedly called into question [3, 48, 8, 46].\nPerturbations imperceptible to humans alter their values, similar to the behavior of adversarial\nexamples [33]. Finally, similar to our metric, FID depends on a set of real examples and a set of\ngenerated examples to compute high-level differences between the distributions, and there is inherent\nvariance to the metric depending on the number of images and which images were chosen\u2014in fact,\nthere exists a correlation between accuracy and budget (cost of computation) in improving FID scores,\nbecause spending a longer time and thus higher cost on compute will yield better FID scores [38].\nNevertheless, this cost is still lower than paid human annotators per image.\nHuman evaluations. Many human-based evaluations have been attempted to varying degrees of\nsuccess in prior work, either to evaluate models directly [14, 42] or to motivate using automated\nmetrics [52, 23]. Prior work also used people to evaluate GAN outputs on CIFAR-10 and MNIST\nand even provided immediate feedback after every judgment [52]. They found that generated MNIST\nsamples have saturated human performance \u2014 i.e. people cannot distinguish generated numbers from\nreal MNIST numbers, while still \ufb01nding 21.3% error rate on CIFAR-10 with the same model [52].\nThis suggests that different datasets will have different levels of complexity for crossing realistic or\nhyper-realistic thresholds. The closest recent work to ours compares models using a tournament of\ndiscriminators [42]. Nevertheless, this comparison was not yet rigorously evaluated on humans nor\nwere human discriminators presented experimentally. The framework we present would enable such\na tournament evaluation to be performed reliably and easily.\n\n7 Discussion and conclusion\n\nEnvisioned Use. We created HYPE as a turnkey solution for human evaluation of generative models.\nResearchers can upload their model, receive a score, and compare progress via our online deployment.\nDuring periods of high usage, such as competitions, a retainer model [4] enables evaluation using\nHYPE\u221e in 10 minutes, instead of the default 30 minutes.\nLimitations. Extensions of HYPE may require different task designs. In the case of text generation\n(translation, caption generation), HYPEtime will require much longer and much higher range adjust-\nments to the perceptual time thresholds [29, 59]. In addition to measuring realism, other metrics\nlike diversity, over\ufb01tting, entanglement, training stability, and computational and sample ef\ufb01ciency\nare additional benchmarks that can be incorporated but are outside the scope of this paper. Some\nmay be better suited to a fully automated evaluation [8, 38]. Similar to related work in evaluating\ntext generation [21], we suggest that diversity can be incorporated using the automated recall score\nmeasures diversity independently from precision F1/8 [51].\nConclusion. HYPE provides two human evaluation benchmarks for generative models that (1) are\ngrounded in psychophysics, (2) provide task designs that produce reliable results, (3) separate\nmodel performance, (4) are cost and time ef\ufb01cient. We introduce two benchmarks: HYPEtime, which\nuses time perceptual thresholds, and HYPE\u221e, which reports the error rate sans time constraints.\nWe demonstrate the ef\ufb01cacy of our approach on image generation across six models {StyleGAN,\nSN-GAN, BigGAN, ProGAN, BEGAN, WGAN-GP}, four image datasets {CelebA-64, FFHQ-1024,\nCIFAR-10, ImageNet-5}, and two types of sampling methods {with, without the truncation trick}.\n\nAcknowledgements\n\nWe thank Kamyar Azizzadenesheli, Tatsu Hashimoto, and Maneesh Agrawala for insightful con-\nversations and support. We also thank Durim Morina and Gabby Wright for their contributions to\nthe HYPE system and website. M.L.G. was supported by a Junglee Corporation Stanford Graduate\nFellowship. This work was supported in part by an Alfred P. Sloan fellowship. Toyota Research\nInstitute (\u201cTRI\u201d) provided funds to assist the authors with their research but this article solely re\ufb02ects\nthe opinions and conclusions of its authors and not TRI or any other Toyota entity.\n\n9\n\n\fReferences\n[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[2] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with\nimproved correlation with human judgments. In Proceedings of the acl workshop on intrinsic\nand extrinsic evaluation measures for machine translation and/or summarization, pp. 65\u201372,\n2005.\n\n[3] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973,\n\n2018.\n\n[4] Michael S Bernstein, Joel Brandt, Robert C Miller, and David R Karger. Crowds in two\nseconds: Enabling realtime crowd-powered interfaces. In Proceedings of the 24th annual ACM\nsymposium on User interface software and technology, pp. 33\u201342. ACM, 2011.\n\n[5] David Berthelot, Thomas Schumm, and Luke Metz. Began: boundary equilibrium generative\n\nadversarial networks. arXiv preprint arXiv:1703.10717, 2017.\n\n[6] Miko\u0142aj Bi\u00b4nkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying\n\nmmd gans. arXiv preprint arXiv:1801.01401, 2018.\n\n[7] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.\n\n[8] Ali Borji. Pros and cons of gan evaluation measures. Computer Vision and Image Understanding,\n\n2018.\n\n[9] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high \ufb01delity\n\nnatural image synthesis. arXiv preprint arXiv:1809.11096, 2018.\n\n[10] Rama Chellappa, Pawan Sinha, and P Jonathon Phillips. Face recognition by computers and\n\nhumans. Computer, 43(2):46\u201355, 2010.\n\n[11] Tom N Cornsweet. The staircrase-method in psychophysics. 1962.\n\n[12] Steven C Dakin and Diana Omigie. Psychophysical evidence for a non-linear representation of\n\nfacial identity. Vision research, 49(18):2285\u20132296, 2009.\n\n[13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-\nscale hierarchical image database. In 2009 IEEE conference on computer vision and pattern\nrecognition, pp. 248\u2013255. Ieee, 2009.\n\n[14] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using\na laplacian pyramid of adversarial networks. In Advances in neural information processing\nsystems, pp. 1486\u20131494, 2015.\n\n[15] Li Fei-Fei, Asha Iyer, Christof Koch, and Pietro Perona. What do we perceive in a glance of a\n\nreal-world scene? Journal of vision, 7(1):10\u201310, 2007.\n\n[16] Joseph Felsenstein. Con\ufb01dence limits on phylogenies: an approach using the bootstrap. Evolu-\n\ntion, 39(4):783\u2013791, 1985.\n\n[17] Paul Fraisse. Perception and estimation of time. Annual review of psychology, 35(1):1\u201337,\n\n1984.\n\n[18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pp. 2672\u20132680, 2014.\n\n[19] Michelle R Greene and Aude Oliva. The briefest of glances: The time course of natural scene\n\nunderstanding. Psychological Science, 20(4):464\u2013472, 2009.\n\n[20] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems,\npp. 5767\u20135777, 2017.\n\n10\n\n\f[21] Tatsunori B Hashimoto, Hugh Zhang, and Percy Liang. Unifying human and statistical evalua-\n\ntion for natural language generation. arXiv preprint arXiv:1904.02792, 2019.\n\n[22] Kenji Hata, Ranjay Krishna, Li Fei-Fei, and Michael S Bernstein. A glimpse far into the future:\nUnderstanding long-term crowd worker quality. In Proceedings of the 2017 ACM Conference\non Computer Supported Cooperative Work and Social Computing, pp. 889\u2013901. ACM, 2017.\n\n[23] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.\nGans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances\nin Neural Information Processing Systems, pp. 6626\u20136637, 2017.\n\n[24] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural\n\ncomputation, 14(8):1771\u20131800, 2002.\n\n[25] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for\n\nimproved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.\n\n[26] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative\n\nadversarial networks. arXiv preprint arXiv:1812.04948, 2018.\n\n[27] Aniket Kittur, Ed H Chi, and Bongwon Suh. Crowdsourcing user studies with mechanical turk.\nIn Proceedings of the SIGCHI conference on human factors in computing systems, pp. 453\u2013456.\nACM, 2008.\n\n[28] Stanley A Klein. Measuring, estimating, and understanding the psychometric function: A\n\ncommentary. Perception & psychophysics, 63(8):1421\u20131455, 2001.\n\n[29] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-\ncaptioning events in videos. In Proceedings of the IEEE International Conference on Computer\nVision, pp. 706\u2013715, 2017.\n\n[30] Ranjay A Krishna, Kenji Hata, Stephanie Chen, Joshua Kravitz, David A Shamma, Li Fei-Fei,\nand Michael S Bernstein. Embracing error to enable rapid crowdsourcing. In Proceedings of\nthe 2016 CHI conference on human factors in computing systems, pp. 3167\u20133179. ACM, 2016.\n\n[31] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\n[32] Gerald P Krueger. Sustained work, fatigue, sleep loss and performance: A review of the issues.\n\nWork & Stress, 3(2):129\u2013141, 1989.\n\n[33] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical\n\nworld. arXiv preprint arXiv:1607.02533, 2016.\n\n[34] John Le, Andy Edmonds, Vaughn Hester, and Lukas Biewald. Ensuring quality in crowdsourced\nIn SIGIR 2010\n\nsearch relevance evaluation: The effects of training question distribution.\nworkshop on crowdsourcing for search evaluation, volume 2126, pp. 22\u201332, 2010.\n\n[35] HCCH Levitt. Transformed up-down methods in psychoacoustics. The Journal of the Acoustical\n\nsociety of America, 49(2B):467\u2013477, 1971.\n\n[36] Angli Liu, Stephen Soderland, Jonathan Bragg, Christopher H Lin, Xiao Ling, and Daniel S\nWeld. Effective crowd annotation for relation extraction. In Proceedings of the 2016 Conference\nof the North American Chapter of the Association for Computational Linguistics: Human\nLanguage Technologies, pp. 897\u2013906, 2016.\n\n[37] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.\n\n[38] Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans\ncreated equal? a large-scale study. In Advances in neural information processing systems, pp.\n698\u2013707, 2018.\n\n[39] Inderjeet Mani. Advances in automatic text summarization. MIT press, 1999.\n\n11\n\n\f[40] Tanushree Mitra, Clayton J Hutto, and Eric Gilbert. Comparing person-and process-centric\nstrategies for obtaining quality data on amazon mechanical turk. In Proceedings of the 33rd\nAnnual ACM Conference on Human Factors in Computing Systems, pp. 1345\u20131354. ACM,\n2015.\n\n[41] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization\n\nfor generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.\n\n[42] Catherine Olsson, Surya Bhupatiraju, Tom Brown, Augustus Odena, and Ian Goodfellow. Skill\n\nrating for generative models. arXiv preprint arXiv:1808.04888, 2018.\n\n[43] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic\nevaluation of machine translation. In Proceedings of the 40th annual meeting on association for\ncomputational linguistics, pp. 311\u2013318. Association for Computational Linguistics, 2002.\n\n[44] Javier Portilla and Eero P Simoncelli. A parametric texture model based on joint statistics of\n\ncomplex wavelet coef\ufb01cients. International journal of computer vision, 40(1):49\u201370, 2000.\n\n[45] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[46] Suman Ravuri, Shakir Mohamed, Mihaela Rosca, and Oriol Vinyals. Learning implicit genera-\n\ntive models with the method of learned moments. arXiv preprint arXiv:1806.11006, 2018.\n\n[47] Keith Rayner, Tim J Smith, George L Malcolm, and John M Henderson. Eye movements and\n\nvisual encoding during scene perception. Psychological science, 20(1):6\u201310, 2009.\n\n[48] Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, and Shakir Mohamed. Vari-\narXiv preprint\n\national approaches for auto-encoding generative adversarial networks.\narXiv:1706.04987, 2017.\n\n[49] Andreas R\u00f6ssler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias\nNie\u00dfner. Faceforensics++: Learning to detect manipulated facial images. arXiv preprint\narXiv:1901.08971, 2019.\n\n[50] Jeffrey M Rzeszotarski, Ed Chi, Praveen Paritosh, and Peng Dai. Inserting micro-breaks into\ncrowdsourcing work\ufb02ows. In First AAAI Conference on Human Computation and Crowdsourc-\ning, 2013.\n\n[51] Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. Assess-\ning generative models via precision and recall. In Advances in Neural Information Processing\nSystems, pp. 5228\u20135237, 2018.\n\n[52] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training gans. In Advances in Neural Information Processing Systems,\npp. 2234\u20132242, 2016.\n\n[53] George Sperling. A model for visual memory tasks. Human factors, 5(1):19\u201331, 1963.\n\n[54] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-\nthinking the inception architecture for computer vision. In Proceedings of the IEEE conference\non computer vision and pattern recognition, pp. 2818\u20132826, 2016.\n\n[55] Lucas Theis, A\u00e4ron van den Oord, and Matthias Bethge. A note on the evaluation of generative\n\nmodels. arXiv preprint arXiv:1511.01844, 2015.\n\n[56] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image\ndescription evaluation. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pp. 4566\u20134575, 2015.\n\n[57] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural\nimage caption generator. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pp. 3156\u20133164, 2015.\n\n12\n\n\f[58] David Warde-Farley and Yoshua Bengio. Improving generative adversarial networks with\n\ndenoising feature matching. 2016.\n\n[59] Daniel S Weld, Christopher H Lin, and Jonathan Bragg. Arti\ufb01cial intelligence and collective\n\nintelligence. Handbook of Collective Intelligence, pp. 89\u2013114, 2015.\n\n[60] Felix A Wichmann and N Jeremy Hill. The psychometric function: I. \ufb01tting, sampling, and\n\ngoodness of \ufb01t. Perception & psychophysics, 63(8):1293\u20131313, 2001.\n\n13\n\n\f", "award": [], "sourceid": 1908, "authors": [{"given_name": "Sharon", "family_name": "Zhou", "institution": "Stanford University"}, {"given_name": "Mitchell", "family_name": "Gordon", "institution": "Stanford University"}, {"given_name": "Ranjay", "family_name": "Krishna", "institution": "Stanford University"}, {"given_name": "Austin", "family_name": "Narcomey", "institution": "Stanford University"}, {"given_name": "Li", "family_name": "Fei-Fei", "institution": "Stanford University"}, {"given_name": "Michael", "family_name": "Bernstein", "institution": "Stanford University"}]}