{"title": "Classification Accuracy Score for Conditional Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 12268, "page_last": 12279, "abstract": "Deep generative models (DGMs) of images are now sufficiently mature that they produce nearly photorealistic samples and obtain scores similar to the data distribution on heuristics such as Frechet Inception Distance (FID). These results, especially on large-scale datasets such as ImageNet, suggest that DGMs are learning the data distribution in a perceptually meaningful space and can be used in downstream tasks. To test this latter hypothesis, we use class-conditional generative models from a number of model classes\u2014variational autoencoders, autoregressive models, and generative adversarial networks (GANs)\u2014to infer the class labels of real data. We perform this inference by training an image classifier using only synthetic data and using the classifier to predict labels on real data. The performance on this task, which we call Classification Accuracy Score (CAS), reveals some surprising results not identified by traditional metrics and constitute our contributions. First, when using a state-of-the-art GAN (BigGAN-deep), Top-1 and Top-5 accuracy decrease by 27.9% and 41.6%, respectively, compared to the original data; and conditional generative models from other model classes, such as Vector-Quantized Variational Autoencoder-2 (VQ-VAE-2) and Hierarchical Autoregressive Models (HAMs), substantially outperform GANs on this benchmark. Second, CAS automatically surfaces particular classes for which generative models failed to capture the data distribution, and were previously unknown in the literature. Third, we find traditional GAN metrics such as Inception Score (IS) and FID neither predictive of CAS nor useful when evaluating non-GAN models. Furthermore, in order to facilitate better diagnoses of generative models, we open-source the proposed metric.", "full_text": "Classi\ufb01cationAccuracyScoreforConditionalGenerativeModelsSumanRavuri&OriolVinyals\u2217DeepMindLondon,UKN1C4AGravuris,vinyals@google.comAbstractDeepgenerativemodels(DGMs)ofimagesarenowsuf\ufb01cientlymaturethattheyproducenearlyphotorealisticsamplesandobtainscoressimilartothedatadis-tributiononheuristicssuchasFrechetInceptionDistance(FID).Theseresults,especiallyonlarge-scaledatasetssuchasImageNet,suggestthatDGMsarelearn-ingthedatadistributioninaperceptuallymeaningfulspaceandcanbeusedindownstreamtasks.Totestthislatterhypothesis,weuseclass-conditionalgenerativemodelsfromanumberofmodelclasses\u2014variationalautoencoders,autoregressivemodels,andgenerativeadversarialnetworks(GANs)\u2014toinfertheclasslabelsofrealdata.Weperformthisinferencebytraininganimageclassi\ufb01erusingonlysyntheticdataandusingtheclassi\ufb01ertopredictlabelsonrealdata.Theperfor-manceonthistask,whichwecallClassi\ufb01cationAccuracyScore(CAS),revealssomesurprisingresultsnotidenti\ufb01edbytraditionalmetricsandconstituteourcontributions.First,whenusingastate-of-the-artGAN(BigGAN-deep),Top-1andTop-5accuracydecreaseby27.9%and41.6%,respectively,comparedtotheoriginaldata;andconditionalgenerativemodelsfromothermodelclasses,suchasVector-QuantizedVariationalAutoencoder-2(VQ-VAE-2)andHierarchicalAutoregressiveModels(HAMs),substantiallyoutperformGANsonthisbench-mark.Second,CASautomaticallysurfacesparticularclassesforwhichgenerativemodelsfailedtocapturethedatadistribution,andwerepreviouslyunknownintheliterature.Third,we\ufb01ndtraditionalGANmetricssuchasInceptionScore(IS)andFIDneitherpredictiveofCASnorusefulwhenevaluatingnon-GANmodels.Furthermore,inordertofacilitatebetterdiagnosesofgenerativemodels,weopen-sourcetheproposedmetric.1IntroductionEvaluatinggenerativemodelsofhigh-dimensionaldataremainsanopenproblem.Despiteanumberofsubtletiesingenerativemodelassessment[1],inaquesttoimprovegenerativemodelsofimages,researchers,andparticularlythosewhohavefocusedonGenerativeAdversarialNetworks[2],haveidenti\ufb01eddesirablepropertiessuchas\u201csamplequality\u201dand\u201cdiversity\u201dandproposedautomaticmetricstomeasurethesedesiderata.Asaresult,recentyearshavewitnessedarapidimprovementinthequalityofdeepgenerativemodels.Whileultimatelytheutilityofthesemodelsistheirperformanceindownstreamtasks,thefocusonthesemetricshasledtomodelswhosesamplersnowgeneratenearlyphotorealisticimages[3\u20135].Foronemodelinparticular,BigGAN-deep[3],resultsonstandardGANmetricssuchasInceptionScore(IS)[6]andFrechetInceptionDistance(FID)[7]approachthoseofthedatadistribution.TheresultsonFID,whichpurportstobetheWasserstein-2metricinaperceptualfeaturespace,inparticularsuggestthatBigGANsarecapturingthedatadistribution.\u2217Correspondingauthor:SumanRavuri(ravuris@google.com).33rdConferenceonNeuralInformationProcessingSystems(NeurIPS2019),Vancouver,Canada.\fBalloonPaddlewheelPencilSharpenerSpatulaFigure1:CASidenti\ufb01esclassesforwhichBigGAN-deepfailstocapturethedatadistribution.Toprowarerealimages,andthebottomtworowsaresamplesfromBigGAN-deep.Asimilar,thoughlessheralded,improvementhasoccurredformodelswhoseobjectivesare(boundsof)likelihood,withtheresultthatmanyofthesemodelsnowalsoproducephotorealisticsamples.Examplesinclude:SubscalePixelNetworks[8],unconditionalautoregressivemodelsof128\u00d7128ImageNetthatachievestate-of-the-arttestsetlog-likelihoods;HierarchicalAutoregressiveModels(HAMs)[9],class-conditionalautoregressivemodelsof128\u00d7128and256\u00d7256ImageNet;andtherecentlyintroducedVector-QuantizedVariationalAutoencoder-2(VQ-VAE-2)[10],avariationalautoencoderthatusesvectorquantizationandanautoregressivepriortoproducehigh-qualitysamples.Notably,thesemodelsmeasurediversityusingtestsetlikelihoodandassesssamplequalitythroughvisualinspection,eschewingthemetricstypicallyusedinGANresearch.Asthesemodelsincreasinglyseem\u201ctolearnthedistribution\u201daccordingtothesemetrics,itisnaturaltoconsidertheiruseindownstreamtasks.Suchaviewcertainlyhasaprecedent:improvedtestsetlikelihoodsinlanguagemodels,unconditionalmodelsoftext,alsoimproveperformanceintaskssuchasspeechrecognition[11].Whileagenerativemodelneednotlearnthedatadistributiontoperformwellonadownstreamtask,poorperformanceonsuchtasksallowsustodiagnosespeci\ufb01cproblemswithbothourgenerativemodelsandthetask-agnosticmetricsweusetoevaluatethem.Tothatend,weuseageneralframework(\ufb01rstposedin[12]andfurtherstudiedin[13])inwhichweuseconditionalgenerativemodelstoperformapproximateinferenceandmeasurethequalityofthatinference.Theideaissimple:foranygenerativemodeloftheformp\u03b8(x|y),welearnaninferencenetwork\u02c6p(y|x)usingonlysamplesfromtheconditionalgenerativemodelandmeasuretheperformanceoftheinferencenetworkonadownstreamtask.Wethencompareperformancetothatofaninferencenetworktrainedonrealdata.Weapplythisframeworktoconditionalimagemodelswhereyistheimagelabel,xistheimage,andtaskisimageclassi\ufb01cation.(N.B.thisapproachhasbeenusedforevaluatingsmallerscaleGANs[12\u201316]).Theperformancemeasureweuse,Top-1andTop-5accuracy,denoteaClassi\ufb01cationAccuracyScore(CAS).Thegapinperformancebetweennetworkstrainedonrealandsyntheticdataallowsustounderstandspeci\ufb01cde\ufb01cienciesinthegenerativemodel.Althoughasimplemetric,CASrevealssomesurprisingresults:\u2022Whenusingastate-of-the-artGAN(BigGAN-deep)andanoff-the-shelfResNet-50classi\ufb01erastheinferencenetwork,wefoundthatTop-1andTop-5accuraciesdecreaseby27.9%and41.6%,respectively,comparedtousingrealdata.\u2022Conditionalgenerativemodelsbasedonlikelihood,suchasVQ-VAE-2andHAM,performwellcomparedtoBigGAN-deep,despiteachievingrelativelypoorInceptionScoresandFrechetInceptionDistances.Sincethesemodelsproducevisuallyappealingsamples,theresultsuggeststhatISandFIDarepoormeasuresofnon-GANmodelperformance.\u2022CASautomaticallysurfacesparticularclassesforwhichBigGAN-deepandVQ-VAE-2failtocapturethedatadistributionandwerepreviouslyunknownintheliterature.Figure1showsfoursuchclassesforBigGAN-deep.\u2022We\ufb01ndthatneitherIS,norFID,norcombinationsthereofarepredictiveofCAS.Asgenerativemodelsmaysoonbedeployedindownstreamtasks,theseresultssuggestthatweshouldcreatemetricsthatbettermeasuretaskperformance.\u2022WecalculateaNaiveAugmentationScore(NAS),avariantofCASwheretheimageclassi\ufb01eristrainedonbothrealandsyntheticimages,todemonstratethatclassi\ufb01cationperformanceimprovesinlimitedcircumstances.AugmentingtheImageNettrainingsetwithlow-diversityBigGAN-deepimagesimprovesTop-5accuracyby0.2%,whileaugmentingthedatasetwithanyothersyntheticimagesdegradesclassi\ufb01cationperformance.2\fInSection2weprovideafewde\ufb01nitions,desiderataofmetrics,andshortcomingsofthemostpopularmetricsinrelationtodifferentresearchdirectionsforgenerativemodeling.Section3de\ufb01nesCAS.Finally,Section4providesalarge-scalestudyofcurrentstate-of-the-artgenerativemodelsusingFID,IS,andCASonboththeImageNetandCIFAR-10datasets.2MetricsforGenerativeModelsMuchofthedif\ufb01cultyinevaluatinganygenerativemodelisnotknowingthetaskforwhichthemodelwillbeused.Understandinghowthemodelwillbedeployed,however,hasimportantimplicationsonitsdesiredproperties.Forexample,considertheseeminglysimilartasksofautomaticspeechrecognitionandspeechsynthesis.Whilebothtasksmaysharethesamegenerativemodelofspeech\u2014suchasahiddenMarkovModelp\u03b8(o,l)withtheobservedandlatentvariablesbeingthewaveformoandwordsequencel,respectively\u2014theimplicationsofmodelmisspeci\ufb01cationarevastlydifferent.Inspeechrecognition,themodelshouldbeabletoinferwordsforallpossiblespeechwaveforms,evenifthewaveformsthemselvesaredegraded.Inspeechsynthesis,however,themodelshouldproducethemostrealistic-soundingsamples,evenifitcannotproduceallpossiblespeechwaveforms.Inparticular,forautomaticspeechrecognition,wecareaboutp\u03b8(l|o),whileforspeechsynthesis,wecareabouto\u223cp\u03b8(o|l).Inabsenceofaknowndownstreamtask,weassesstowhatextentthemodeldistributionp\u03b8(x)matchesthedatadistributionpdata(x),alessspeci\ufb01candoftenmoredif\ufb01cultgoal.Twoconsequencesofthetrivialobservationthatp\u03b8(x)=pdata(x)are:1)eachsamplex\u223cp\u03b8(x)\u201ccomes\u201dfromthedatadistribution(i.e.,itisa\u201cplausible\u201dsamplefromthedatadistribution),and2)thatallpossibleexamplesfromthedatadistributionarerepresentedbythemodel.Differentmetricsthatevaluatethedegreeofmodelmismatchweighthesecriteriadifferently.Furthermore,weexpectourmetricstoberelativelyfasttocalculate.Thislastdesideratumoftendependsonthemodelclass.Themostpopularseemtobe:\u2022(Inexact)Likelihoodmodelsusingvariationalinference(e.g.,VAE[17,18])\u2022Likelihoodusingautoregressivemodels(e.g.,PixelCNN[19])\u2022Likelihoodmodelsbasedonbijections(e.g.,GLOW[20],rNVP[21])\u2022(Possiblyinexact)likelihoodusingenergy-basedmodels(e.g.,RBM[22])\u2022Implicitgenerativemodels(e.g.,GANs)Forthe\ufb01rstfouroftheseclasses,thelikelihoodobjectiveprovidesusscaledestimatesoftheKL-divergencebetweenthedataandmodel.Furthermore,testsetlikelihoodisalsoanimplicitmeasureofdiversity.Thelikelihood,however,isafairlypoormeasureofsamplequality[1]andoftenscoresout-of-domaindatamorehighlythanin-domaindata[23].Forimplicitmodels,theobjectiveprovidesneitheranaccurateestimateofastatisticaldivergenceordistancenoranaturalevaluationmetric.Thelackofanysuchmetricslikelyforcedresearcherstoproposeheuristicsthatmeasureversionsofboth1and2(samplequalityanddiversity)simultaneously.InceptionScore(IS)[6](exp(Ex[p(y|x)kp(y)])measures1byhowcon\ufb01dentlyaclassi\ufb01erassignsanimagetoaparticularclass(p(y|x)),and2bypenalizingiftoomanyimageswereclassi\ufb01edtothesameclass(p(y)).MoreprincipledversionsofthisprocedureareFrechetInceptionDistance(FID)[7]andKernelInceptionDistance(KID)[24],whichbothusevariantsoftwo-sampletestsinalearned\u201cperceptual\u201dfeaturespace,theInceptionpool3space,toassessdistributionmatching.Eventhoughthisspacewasanad-hocproposition,recentwork[25]suggeststhatdeepfeaturescorrelatewithhumanperceptionofsimilarity.Evenmorerecentwork[26,27]calculate1and2independentlybycalculatingprecisionandrecall.RelianceonISandFIDinparticularhasledtoimprovementinGANmodelsbuthascertainde\ufb01ciencies.ISdoesnotpenalizealackofintra-classdiversity,andcertainout-of-distributionsamplesproduceInceptionScoresthreetimeshigherthanthatofthedata[28].FID,ontheotherhand,suffersfromahighdegreeofbias[24].Moreover,thepool3featurelayermaynotevencorrelatewellwithhumanjudgmentofsamplequality[29].Inthiswork,wealso\ufb01ndthatnon-GANmodelshaveratherpoorInceptionScoresandFrechetInceptionDistances,eventhoughthesamplesarevisuallyappealing.3\fRatherthancreatingad-hocheuristicsaimedatbroadlymeasuringsamplequalityanddiversity,weinsteadevaluategenerativemodelsbyassessingtheirperformanceonadownstreamtask.Thisisakintomeasuringagenerativemodelofspeechbyevaluatingitonautomaticspeechrecognition.Sincemodelsconsideredhereareimplicitordonotadmitexactlikelihoods,exactinferenceisdif\ufb01cult.Tocircumventthisissue,wetrainaninferencenetworkonsamplesfromthemodel.Ifthegenerativemodelisindeedcapturingthedatadistribution,thenwecouldreplacetheoriginaldistributionwithamodel-generatedone,performanydownstreamtask,andobtainthesameresult.Inthiswork,westudyperhapsthesimplestdownstreamtask:imageclassi\ufb01cation.Thisideaisnotnecessarilynew:forGANevaluation,ithasbeenindependentlydiscoveredatleastfourtimes.[12]\ufb01rstintroducedthemetric(denoted\u201cadversarialaccuracy\u201d)tomeasuretheirproposedLayer-RecursiveGANandconnectedimageclassi\ufb01cationtoapproximateinference.[13]moresystematicallystudiedthisideaofapproximateinferencetomeasuretheboundarydistortioninducedbyGANs.Theydidthisbytrainingseparateper-labelunconditionalgenerativemodels,andthentrainedclassi\ufb01ersonsyntheticdatatounderstandhowtheboundaryshiftedandtomeasurethesamplediversityofGANs.Predating[13],[14]used\u201cTrainonSynthetic,TestonReal\u201dtomeasurearecurrentconditionalGANformedicaldata.[16]trainedonsyntheticdata,testedonreal(denoted\u201cGAN-train\u201d)asanapproximaterecallmetricforGANs.Theyalsotrainedonrealdataandtestedonsynthetic(denoted\u201cGAN-test\u201d)asanapproximateprecisiontest.Unlikepreviouswork,theytestedonlargerdatasetssuchas128\u00d7128ImageNet,butwithsmallerscalemodelssuchasSNGAN[30].Themetricsmentionedabovearebynomeanstheonlyones,andresearchershaveproposedmethodstoevaluateotherpropertiesofgenerativemodels.[31]constructsapproximatemanifoldsfromdataandsamples,andappliesthemethodtoGANsamplestodeterminewhethermodecollapseoccurred.[32]attemptstodeterminethesupportsizeofGANsbyusingaBirthdayParadoxtest,thoughtheprocedurerequiresahumantoidentifytwonearly-identicalsamples.MaximumMeanDiscrepancy[33]isatwo-sampletestthathasmanynicetheoreticalpropertiesbutseemstobelessusedbecausethechoiceofkernelsdonotnecessarilycoincidewithhumanjudgment.Procedurallysimilartoourmethod,[34]proposesa\u201creverseLMscore\u201d,whichtrainsalanguagemodelonGANdataandtestsonarealheld-outset.[35]measuresthequalityofgenerativemodelsoftextbytrainingasentimentanalysisclassi\ufb01er.Finally,[36]classi\ufb01esrealdatausingastudentnetworkmimickingateachernetworkpretrainedonrealdatabutdistilledonGANdata.Ourworkmostcloselymirrors[16],butdiffersinasomekeyrespects.First,sinceweviewimageclassi\ufb01cationasapproximateinference,weareabletodescribeitslimitationsinSection3,andverifytheapproximationinSection4.5.Second,whilein[16]performanceonGAN-traincorrelateswithimprovedISandFID,wefocusmoreonlarge-scaleandnon-GANmodels,suchasVQ-VAE-2andHAMs,whereFIDandISarenotindicativeofclassi\ufb01cationperformance.Third,bypollingtheinferencenetwork,wecanidentifyclassesforwhichthemodelfailedtocapturethedatadistribution.Finally,weopen-sourcethemetricforImageNetforeaseofevaluatinglarge-scalegenerativemodels.3Classi\ufb01cationAccuracyScoreAttheheartofCASliesaverysimpleidea:ifthemodelcapturesthedatadistribution,performanceonanydownstreamtaskshouldbesimilarwhetherusingtheoriginalormodeldata.Tomakethisintuitionmoreprecise,supposethatdatacomesfromadistributionp(x,y),thetaskistoinferyfromx,andwesufferalossL(y,\u02c6y)forpredicting\u02c6ywhenthetruelabelisy.Theriskassociatedwithaclassi\ufb01er\u02c6y=f(x)is:Ep(x,y)[L(y,\u02c6y)]=Ep(x)[Ep(y|x)[L(y,\u02c6y)|X]](1)Asweonlyhavesamplesfromp(x,y),wemeasuretheempiricalrisk1NL(yi,f(xi)).FromtherighthandsideofEquation1,ofthesetofpredictionsY,theoptimalone\u02c6yminimizestheexpectedposteriorloss:\u02c6y=argminy0\u2208YEp(y|x)[L(y,y0)|X](2)Assumingweknowthelabeldistributionp(y),agenerativemodelingapproachtothisproblemistomodeltheconditionaldistributionp\u03b8(x|y),andinferlabelsusingBayesrule:p\u03b8(y|x)=p\u03b8(x|y)p(y)p\u03b8(x).Ifp\u03b8(y|x)=p(y|x),thenwecanmakepredictionsthatminimizetheriskforanylossfunction.Iftheriskisnotminimized,however,thenwecanconcludethatdistributionsarenotmatched,andwecaninterrogatep\u03b8(y|x)tobetterunderstandhowourgenerativemodelsfailed.4\fFormostmoderndeepgenerativemodels,however,wehaveaccesstoneitherp\u03b8(x|y),theprobabilityofthedatagiventhelabel,norp\u03b8(y|x),themodelconditionaldistribution,norp(y|x),thetrueconditionaldistribution.Instead,fromsamplesx,y\u223cp(y)p\u03b8(x|y),wetrainadiscriminativemodel\u02c6p(y|x)tolearnp\u03b8(y|x),anduseittoestimatetheexpectedposteriorlossE\u02c6p(y|x)[L(y,\u02c6y)|X].Wede\ufb01nethegenerativeriskasEp(x,y)[L(y,\u02c6yg)],where\u02c6ygistheclassi\ufb01erthatminimizestheexpectedposteriorlossunder\u02c6p(y|x).Thenwecomparetheperformanceoftheclassi\ufb01ertotheperformanceoftheclassi\ufb01ertrainedonsamplesfromp(x,y).Inthecaseofconditionalgenerativemodelsofimages,yistheclasslabelforimagex,andthemodelof\u02c6p(y|x)isanimageclassi\ufb01er.WeuseResNets[37]inthiswork.ThelossfunctionsLweexplorearethestandardonesforimageclassi\ufb01cation.Oneis0-1,whichyieldsTop-1accuracy,andtheotheris0-1intheTop-5,whichyieldsTop-5accuracy.2Procedurally,wetrainaclassi\ufb01eronsyntheticdata,andevaluatetheperformanceoftheclassi\ufb01eronrealdata.WecalltheaccuracytheClassi\ufb01cationAccuracyScore(CAS).NotethataCASclosetothatforthedatadoesnotimplythatthegenerativemodelaccuratelymodeledthedatadistribution.Thismayhappenforafewreasons.First,p\u03b8(y|x)=p(y|x)foranygenerativemodelthatsatis\ufb01esp\u03b8(x|y)p\u03b8(x)=p(x|y)p(x)forallx,y\u223cp(x,y).Oneexampleisagenerativemodelthatsamplesfromthetruedistributionwithprobabilityp,andfromanoisedistributionwithasupportdisjointfromthetruedistributionwithprobability1\u2212p.Inthiscase,ourinferencemodelisgoodbuttheunderlyinggenerativemodelispoor.Second,sincethelossesconsideredherearenotproperscoringrules[38],onecouldobtainreasonableCASfromsuboptimalinferencenetworks.Forexample,supposethatp(y|x)=1.0forthecorrectclasswhile\u02c6p(y|x)=0.51forthecorrectclassduetopoorsyntheticdata.CASforbothis100%.Usingaproperscoringrule,suchasBrierScore,eliminatesthisissue,butexperimentallywefoundlimitedpracticalbene\ufb01tfromusingone.Finally,agenerativemodelthatmemorizesthetrainingsetwillachievethesameCASastheoriginaldata.3Ingeneral,however,wehopethatgenerativemodelsproducesamplesdisjointfromthesetonwhichtheyaretrained.Ifthesamplesaresuf\ufb01cientlydifferent,wecantrainaclassi\ufb01eronboththeoriginaldataandmodeldataandexpectimprovedaccuracy.Wedenotetheperformanceofclassi\ufb01erstrainedonthis\u201cnaiveaugmentation\u201dNaiveAugmentationScore(NAS).OurCASresults,however,indicatethatthecurrentmodelsstillsigni\ufb01cantlyunder\ufb01t,renderingtheconclusionslesscompelling.Forcompleteness,weincluderesultsonaugmentationinSection4.4.Despitethesetheoreticalissues,we\ufb01ndthatgenerativemodelshaveClassi\ufb01cationAccuracyScoreslowerthantheoriginaldata,indicatingthattheyfailtocapturethedatadistribution.3.1ComputationandOpen-SourcingMetricComputationally,trainingclassi\ufb01ersissigni\ufb01cantlymoredemandingthancalculatingFIDorISover50,000samples.Webelieve,however,thatnowistherighttimeforsuchametricduetoafewkeyadvancesintrainingclassi\ufb01ers:1)thetrainingofImageNetclassi\ufb01ershasbeenreducedtominutes[39],2)withcloudservices,thevarianceduetoimplementationdetailsofsuchametricislargelymitigated,and3)thepriceandtimecostoftrainingclassi\ufb01ersoncloudservicesisreasonableandwillonlyimproveovertime.Moreover,manyclass-conditionalgenerativemodelsarecomputationallyexpensivetotrain,andthus,evenarelativelyexpensivemetricsuchasCAScomprisesasmallpercentageofthetrainingbudget.Weopen-sourceourmetriconGoogleCloudforotherstouse.TheinstructionsaregiveninAppendixB.Atthetimeofwriting,onecancomputethemetricin10hoursforroughly$15,orin45minutesforroughly$85usingTPUs.Moreover,dependingonaf\ufb01liation,onemaybeabletoaccessTPUsforfreeusingtheTensor\ufb02owResearchCloud(TFRC)(https://www.tensorflow.org/tfrc/).2Itismorecorrecttostatethatthelossesyielderrors,butwepresentresultsasaccuraciesinsteadastheyarestandardincomputervisionliterature.3N.B.ISandFIDalsosufferthesamefailuremode.5\fTable1:CASfordifferentmodelsat128\u00d7128and256\u00d7256resolutions.BigGAN-deepsamplesaretakenfrombesttruncationparameterof1.5.TrainingSetResolutionTop-5Top-1ISFID-50KAccuracyAccuracyReal128\u00d712888.79%68.82%165.38\u00b12.841.61BigGAN-deep128\u00d712864.44%40.64%71.31\u00b11.574.22HAM128\u00d712877.33%54.05%17.02\u00b10.7946.05Real256\u00d725691.47%73.09%331.83\u00b15.002.47BigGAN-deep256\u00d725665.92%42.65%109.39\u00b11.5611.78VQ-VAE-2256\u00d725677.59%54.83%43.44\u00b10.8738.05Figure2:Comparisonofper-classaccuracyofdata(blue)vs.model(red).Left:BigGAN-deep256\u00d7256at1.5truncationlevel.Middle:VQ-VAE-2256\u00d7256.Right:HAM128\u00d7128.4ExperimentsOurexperimentsaresimple:onImageNet,weusethreegenerativemodels\u2014BigGAN-deepat256\u00d7256and128\u00d7128resolutions,HAMwithmaskedself-predictionauxiliarydecoderat128\u00d7128resolution,andVQ-VAE-2at256\u00d7256resolution\u2014toreplacetheImageNettrainingsetwithamodel-generatedone,trainanimageclassi\ufb01er,andevaluateperformanceontheImageNetvalidationset.TocalculateCAS,wereplacetheImageNettrainingsetwithonesampledfromthemodel,andeachexamplefromtheoriginaltrainingsetisreplacedwithamodelsamplefromthesameclass.Inaddition,wecompareCAStotwotraditionalGANmetrics:ISandFID,asthesemetricsarethecurrentgoldstandardforGANcomparisonandarebeingusedtocomparenon-GANtoGANmodels.Bothrelyonafeaturespacefromaclassi\ufb01ertrainedonImageNet,suggestingthatifmetricsareusefulatpredictingperformanceonadownstreamtask,itwouldindeedbethisone.FurtherdetailsabouttheexperimentcanbefoundinAppendixA.1.4.1ModelComparisononImageNetTable1showstheperformanceofclassi\ufb01erstrainedonmodel-generateddatasetscomparedtothoseontherealdatasetfor256\u00d7256and128\u00d7128,respectively.At256\u00d7256resolution,BigGAN-deepachievesaCASTop-5of65.92%,suggestingthatBigGANsarelearningnontrivialdistributions.Perhapssurprisingly,VQ-VAE-2,thoughperformingquitepoorlycomparedtoBigGAN-deeponbothFIDandIS,obtainsaCASTop-5accuracyof77.59%.Bothmodels,however,lagtheoriginal256\u00d7256dataset,whichachievesaCASTop-5Accuracyof91.47%.We\ufb01ndnearlyidenticalresultsforthe128\u00d7128models.BigGAN-deepachievesCASTop-5andTop-1similartothe256\u00d7256model(notethatISandFIDresultsfor128\u00d7128and256\u00d7256BigGAN-deeparevastlydifferent).HAMs,similartoVQ-VAE-2,performpoorlyonFIDandInceptionScorebutoutperformBigGAN-deeponCAS.Moreover,bothmodelsunderperformrelativetotheoriginal128\u00d7128dataset.4.2UncoveringModelDe\ufb01cienciesTobetterunderstandwhataccountsforthegapbetweengenerativemodelanddatasetCAS,webrokedowntheperformancebyclass(Figure2).Asshownintheleftpane,nearlyeveryclassofBigGAN-deepsuffersadropinperformancecomparedtotheoriginaldataset,thoughsixclasses\u20146\fBigGAN-deepVQ-VAE-2HAMFigure3:Thetoptworowsaresamplesfromclassesthatachievedthebesttestsetperformancerelativetooriginaldataset.Thebottomtworowsarethosefromclassesthatachievedtheworst.Left:BigGAN-deeptoptwo\u2014squirrelmonkeyandredfox\u2014andbottomtwo\u2014(hotair)balloonandpaddlewheel.Middle:VQ-VAE-2toptwo\u2014redfoxandAfricanelephant\u2014andbottomtwo\u2014guillotineandfurcoat.Right:HAMtoptwo\u2014huskyandgong/tim-tam\u2014andbottomtwo\u2014hermitcrabandharmonica.Figure4:Top:Originals,Bottom:ReconstructionsusingVQ-VAE-2.partridge,redfox,jaguar/panther,squirrelmonkey,Africanelephant,andstrawberry\u2014showmarginalimprovementovertheoriginaldataset.TheleftpaneofFigure3showsthetwobestandtwoworstperformingcategories,asmeasuredbythedifferenceinclassi\ufb01cationperformance.Notably,forthetwoworstperformingcategoriesandtwoothers\u2014balloon,paddlewheel,pencilsharpener,andspatula\u2014classi\ufb01cationaccuracywas0%onthevalidationset.Theper-classbreakdownofVQ-VAE-2(middlepaneofFigure2)showsthatthismodelalsounderperformstherealdataformostclasses(only31classesperformbetterthantheoriginaldata),thoughthegapisnotaslargeasforBigGAN-deep.Furthermore,VQ-VAE-2hasbettergeneralizationperformancein87.6%ofclassescomparedtoBigGAN-deep,andsuffers0%classi\ufb01cationaccuracyfornoclasses.ThemiddlepaneofFigure3showsthetwobestandtwoworstperformingcategories.TherightpanesofFigures2and3showtheper-classbreakdownandtopandbottomtwoclasses,respectively,forHAMs.ResultsbroadlymirrorthoseofVQ-VAE-2.4.3ANoteonFIDandaSecondNoteonISWenotethatISandFIDhaveverylittlecorrelationwithCAS,suggestingthatalternativemetricsareneededifweintendtodeployourmodelsondownstreamtasks.Asacontrolledexperiment,wecalculateCAS,IS,andFIDforBigGAN-deepmodelswithinputnoisedistributionstruncatedatdifferentvalues(knownasthe\u201ctruncationtrick\u201d).Asnotedin[3],lowertruncationvaluesseemtoimprovesamplequalityattheexpenseofdiversity.ForCAS,thecorrelationcoef\ufb01cientbetweenTop-1AccuracyandFIDis0.16,andIS-0.86,thelatterresultincorrectlysuggestingthatimprovedISishighlycorrelatedwithpoorerperformance.Moreover,thebest-performingtruncationvalues(1.5and2.0)haveratherpoorISsandFIDs.ThatthesepoorIS/FIDalsoseemtoindicatepoorperformanceonthismetricisnosurprise;thatothermodels,withwell-performingISsandFIDsyieldpoorperformanceonCASsuggeststhatalternativemetricsareneeded.OnecaneasilydiagnosetheissuewithIS:asnotedin[28],ISdoesnotaccountforintra-classdiversity,andatrainingsetwithlittleintra-classdiversitymaymaketheclassi\ufb01erfailtogeneralizetoamorediversetestset.7\fTable2:CASforVQ-VAE-2modelreconstructionsandBigGAN-deepmodelsatdifferenttruncationlevelsat256\u00d7256resolution.TrainingSetTruncationTop-5Top-1ISFID-50KAccuracyAccuracyBigGAN-deep0.2013.24%5.11%339.06\u00b13.1420.75BigGAN-deep0.4228.68%13.30%324.62\u00b13.2915.93BigGAN-deep0.5032.88%15.66%316.31\u00b13.7014.37BigGAN-deep0.6045.01%25.51%299.51\u00b13.2012.41BigGAN-deep0.8056.68%32.88%258.72\u00b12.869.24BigGAN-deep1.0062.97%39.07%214.64\u00b12.017.42BigGAN-deep1.5065.92%42.65%109.39\u00b11.5611.78BigGAN-deep2.0064.37%40.98%49.54\u00b10.9828.67VQ-VAE-2reconstructions-89.46%69.90%203.89\u00b12.558.69Real-91.47%73.09%331.83\u00b15.002.47FIDshouldbetteraccountforthislackofdiversityatleastgrossly,asthemetric,calculatedasFID(Px,Py)=k\u00b5x\u2212\u00b5yk2+tr(\u03a3x+\u03a3y\u22122(\u03a3x\u03a3y)1/2),comparesthecovariancematricesofthedataandmodeldistribution.Bycomparison,CASoffersa\ufb01nermeasureofmodelperformance,asitprovidesusaper-classmetrictoidentifywhichclasseshavebetterorworseperformance.Whileintheoryonecouldcalculateaper-classFID,FIDisknowntosufferfromhighbias[24]foralownumberofsamples,suggestingthatper-classestimateswouldbeunreliable.4PerhapsalargerissueisthatISandFIDheavilypenalizenon-GANmodels,suggestingthattheseheuristicsarenotsuitableforinter-model-classcomparisons.AparticularlyegregiousfailureisthatISandFIDaggressivelypenalizecertaintypesofsamplesthatlooknearlyidenticaltotheoriginaldataset.Forexample,wecomputedCAS,IS,andFIDontheImageNettrainingsetat256\u00d7256resolutionandonreconstructionsfromVQ-VAE-2.AsshowninFigure4,thereconstructionslooknearlyidenticaltotheoriginaldata.AsnotedinTable2,however,ISdecreasesby128pointsandFIDincreasesby3.5\u00d7.ThedropinperformanceissogreatthatBigGAN-deepat1.00truncationachievesbetterISandFIDthannearly-identicalreconstructions.CASTop-1andTop-5forthereconstructions,however,dropsby2.2%and4.4%,respectively,relativetotheoriginaldataset.CASforBigGAN-deepmodelat1.00truncation,ontheotherhand,dropsby31.1%and46.5%relative.4.4NaiveAugmentationScoreTocalculateNAS,weaddtotheoriginalImageNettrainingset25%,50%,or100%moredatafromeachofourmodels.TheoriginalImageNettrainingsetachievesaTop-5accuracyof92.97%.AlthoughtheCASresultsforBigGAN-deep,andtoalesserextentVQ-VAE-2,suggestthataugment-ingtheoriginaltrainingsetwithmodelsampleswillnotresultinimprovedclassi\ufb01cationperformance,wewantedtostudywhethertherelativeorderingontheCASexperimentswouldholdfortheNASones.Figure5illustratestheperformanceoftheclassi\ufb01ersasweincreasetheamountofsynthetictrainingdata.Perhapssomewhatsurprisingly,BigGAN-deepmodelsthatsamplefromlowertrun-cationvalues,andhavelowersamplediversity,areabletoperformbetterfordataaugmentationcomparedtothosemodelsthatperformedwellonCAS.Infact,forsomeofthelowesttruncationvalues,wefoundamodestimprovementinclassi\ufb01cationperformance:roughly0.2%.Moreover,VQ-VAE-2underperformsrelativetoBigGAN-deepmodels.Ofcourse,thecaveatisthattheformermodeldoesnotyethaveamechanismtotradeoffsamplequalityfromsamplediversity.Theresultsonaugmentationhighlightdifferentdesiderataforsamplesthatareaddedtothedatasetratherthanreplaced.Clearly,thesamplesaddedshouldbesuf\ufb01cientlydifferentfromthedatatoallowtheclassi\ufb01ertobettergeneralize,yetpoorersamplequalitymayleadtopoorergeneralizationcomparedtotheoriginaldataset.Thismaybethereasonwhyextendingthedatasetwithsamplesgeneratedfromalowertruncationvalue\u2014whicharehigher-quality,butlessdiverse\u2014performbetter4[24]proposedKID,anunbiasedalternativetoFID,butthevarianceofthismetricistoolargetobereliablewhenusingthenumberofper-classsamplesintheImageNettrainingset(roughly1,000perclass),andisworsewhenusingthe50inthevalidationset.Inadditiontosufferinghighbias,per-classFIDrequiresestimationoftherealdatacovariancematrixofrank2048usingfarfewersamples,leadingtorankde\ufb01ciency.8\fFigure5:Top-5accuracyastrainingdataisaugmentedbyx%examplesfromBigGAN-deepfordifferenttruncationlevels.Lowertruncationgeneratesdatasetswithlesssamplediversity.Table3:CASfordifferentmodelsofCIFAR-10.PixelCNN-Bayesdenotesclassi\ufb01cationaccuracyusingexactinferenceusingthegenerativemodel.RealBigGANcGANPixelCNNPixelCNN-BayesPixelIQNAccuracy92.58%71.87%76.35%64.02%60.05%74.26%onNASthanCAS.Furthermore,thismayalsoexplainwhyIS,FID,andCASarenotpredictiveofNAS.4.5ModelComparisononCIFAR-10Finally,wealsocompareCASfordifferentmodelclassesonCIFAR-10.Wecomparefourmodels:BigGAN,cGANwithProjectionDiscriminator[30],PixelCNN[19],andPixelIQN[40].WetrainaResNet-56followingthetrainingprocedureof[37].MoredetailscanbefoundinAppendixA.2.SimilartotheImageNetexperiments,we\ufb01ndthatbothGANsproducesampleswithacertaindegreeofgeneralization.GANsalsosigni\ufb01cantlyoutperformPixelCNNonthisbenchmark.Furthermore,sincePixelCNNisanexactlikelihoodmodel,wecancompareclassi\ufb01cationperformancewithexactinferenceusingBayesruletothatwithapproximateinferenceusingaclassi\ufb01er.Perhapssurprisingly,CASforPixelCNNismodestlybetterthanclassi\ufb01cationaccuracyusingexactinference,thoughbothresultsaresimilar.Finally,PixelIQNhassimilarperformancetothenewerGANs.5ConclusionGoodmetricshavelongbeenanimportant,andperhapsunderappreciated,componentindrivingimprovementsinmodels.Itmaybeparticularlyimportantnow,asgenerativemodelshavereachedamaturitythattheymaybedeployedindownstreamtasks.Weproposedone,Classi\ufb01cationAccuracyScore,forconditionalgenerativemodelsofimagesandfoundthemetricpracticallyusefulinuncov-eringmodelde\ufb01ciencies.Furthermore,we\ufb01ndthatGANmodelsofImageNet,despitehighsamplequality,tendtounderperformmodelsbasedonlikelihood.Finally,we\ufb01ndthatISandFIDunfairlypenalizenon-GANmodels.Anopenquestioninthisworkisunderstandingtowhatextentthesemodelsgeneralizebeyondthetrainingset.Whilecurrentresultssuggestthatevenstate-of-the-artmodelscurrentlyunder\ufb01t,recentprogressindicatesthatunder\ufb01ttingmaybeatemporaryissue.Measuringgeneralizationwillthenbeofprimaryimportance,especiallyifmodelsaredeployedondownstreamtasks.9\fAcknowledgmentsWewouldliketothankAliRazavi,AaronvandenOord,AndyBrock,Jean-BaptisteAlayrac,JeffreyDeFauw,SanderDieleman,JeffDonahue,KarenSimonyan,TakeruMiyato,andGeorgOstrovskifordiscussionandhelpwithmodels.Furthermore,wewouldliketothankthosewhocontactedus,pointingustopriorwork.References[1]L.Theis,A.vandenOord,andM.Bethge.Anoteontheevaluationofgenerativemodels.InInternationalConferenceonLearningRepresentations,2016.arXiv:1511.01844.[2]IanGoodfellow,JeanPouget-Abadie,MehdiMirza,BingXu,DavidWarde-Farley,SherjilOzair,AaronCourville,andYoshuaBengio.Generativeadversarialnets.InAdvancesinneuralinformationprocessingsystems,pages2672\u20132680,2014.[3]AndrewBrock,JeffDonahue,andKarenSimonyan.Largescalegantrainingforhigh\ufb01delitynaturalimagesynthesis.arXivpreprintarXiv:1809.11096,2018.[4]TeroKarras,TimoAila,SamuliLaine,andJaakkoLehtinen.Progressivegrowingofgansforimprovedquality,stability,andvariation.arXivpreprintarXiv:1710.10196,2017.[5]TeroKarras,SamuliLaine,andTimoAila.Astyle-basedgeneratorarchitectureforgenerativeadversarialnetworks.arXivpreprintarXiv:1812.04948,2018.[6]TimSalimans,IanGoodfellow,WojciechZaremba,VickiCheung,AlecRadford,andXiChen.Improvedtechniquesfortraininggans.InAdvancesinNeuralInformationProcessingSystems,pages2234\u20132242,2016.[7]MartinHeusel,HubertRamsauer,ThomasUnterthiner,BernhardNessler,andSeppHochreiter.Ganstrainedbyatwotime-scaleupdateruleconvergetoalocalnashequilibrium.InAdvancesinNeuralInformationProcessingSystems,pages6626\u20136637,2017.[8]JacobMenickandNalKalchbrenner.Generatinghigh\ufb01delityimageswithsubscalepixelnetworksandmultidimensionalupscaling.arXivpreprintarXiv:1812.01608,2018.[9]JeffreyDeFauw,SanderDieleman,andKarenSimonyan.Hierarchicalautoregressiveimagemodelswithauxiliarydecoders.arXivpreprintarXiv:1903.04933,2019.[10]AliRazavi,AaronvandenOord,andOriolVinyals.Generatingdiversehigh-\ufb01delityimageswithvq-vae-2.arXivpreprintarXiv:1906.00446,2019.[11]StanleyFChen,DouglasBeeferman,andRoniRosenfeld.Evaluationmetricsforlanguagemodels.1998.[12]JianweiYang,AnithaKannan,DhruvBatra,andDeviParikh.Lr-gan:Layeredrecursivegenerativeadversarialnetworksforimagegeneration.arXivpreprintarXiv:1703.01560,2017.[13]ShibaniSanturkar,LudwigSchmidt,andAleksanderM\u02dbadry.Aclassi\ufb01cation-basedstudyofcovariateshiftingandistributions.arXivpreprintarXiv:1711.00970,2017.[14]Crist\u00f3balEsteban,StephanieLHyland,andGunnarR\u00e4tsch.Real-valued(medical)timeseriesgenerationwithrecurrentconditionalgans.arXivpreprintarXiv:1706.02633,2017.[15]Timoth\u00e9eLesort,Jean-Fran\u00e7oisGoudou,andDavidFilliat.Trainingdiscriminativemodelstoevaluategenerativeones.arXivpreprintarXiv:1806.10840,2018.[16]KonstantinShmelkov,CordeliaSchmid,andKarteekAlahari.Howgoodismygan?InProceedingsoftheEuropeanConferenceonComputerVision(ECCV),pages213\u2013229,2018.[17]DiederikPKingmaandMaxWelling.Auto-encodingvariationalbayes.arXivpreprintarXiv:1312.6114,2013.10\f[18]DaniloJimenezRezende,ShakirMohamed,andDaanWierstra.Stochasticbackpropagationandapproximateinferenceindeepgenerativemodels.arXivpreprintarXiv:1401.4082,2014.[19]AaronVandenOord,NalKalchbrenner,LasseEspeholt,OriolVinyals,AlexGraves,etal.Conditionalimagegenerationwithpixelcnndecoders.InAdvancesinneuralinformationprocessingsystems,pages4790\u20134798,2016.[20]DurkPKingmaandPrafullaDhariwal.Glow:Generative\ufb02owwithinvertible1x1convolutions.InAdvancesinNeuralInformationProcessingSystems,pages10215\u201310224,2018.[21]LaurentDinh,JaschaSohl-Dickstein,andSamyBengio.Densityestimationusingrealnvp.arXivpreprintarXiv:1605.08803,2016.[22]PaulSmolensky.Informationprocessingindynamicalsystems:Foundationsofharmonytheory.Technicalreport,ColoradoUnivatBoulderDeptofComputerScience,1986.[23]EricNalisnick,AkihiroMatsukawa,YeeWhyeTeh,DilanGorur,andBalajiLakshminarayanan.Dodeepgenerativemodelsknowwhattheydon\u2019tknow?arXivpreprintarXiv:1810.09136,2018.[24]Miko\u0142ajBi\u00b4nkowski,DougalJSutherland,MichaelArbel,andArthurGretton.Demystifyingmmdgans.arXivpreprintarXiv:1801.01401,2018.[25]RichardZhang,PhillipIsola,AlexeiAEfros,EliShechtman,andOliverWang.Theunrea-sonableeffectivenessofdeepfeaturesasaperceptualmetric.InProceedingsoftheIEEEConferenceonComputerVisionandPatternRecognition,pages586\u2013595,2018.[26]MehdiSMSajjadi,OlivierBachem,MarioLucic,OlivierBousquet,andSylvainGelly.Assess-inggenerativemodelsviaprecisionandrecall.InAdvancesinNeuralInformationProcessingSystems,pages5228\u20135237,2018.[27]TuomasKynk\u00e4\u00e4nniemi,TeroKarras,SamuliLaine,JaakkoLehtinen,andTimoAila.Improvedprecisionandrecallmetricforassessinggenerativemodels.arXivpreprintarXiv:1904.06991,2019.[28]ShaneBarrattandRishiSharma.Anoteontheinceptionscore.arXivpreprintarXiv:1801.01973,2018.[29]SharonZhou,MitchellGordon,RanjayKrishna,AustinNarcomey,DurimMorina,andMichaelSBernstein.Hype:Humaneyeperceptualevaluationofgenerativemodels.arXivpreprintarXiv:1904.01121,2019.[30]TakeruMiyatoandMasanoriKoyama.cganswithprojectiondiscriminator.arXivpreprintarXiv:1802.05637,2018.[31]ValentinKhrulkovandIvanOseledets.Geometryscore:Amethodforcomparinggenerativeadversarialnetworks.arXivpreprintarXiv:1802.02664,2018.[32]SanjeevAroraandYiZhang.Dogansactuallylearnthedistribution?anempiricalstudy.arXivpreprintarXiv:1706.08224,2017.[33]ArthurGretton,KarstenMBorgwardt,MalteJRasch,BernhardSch\u00f6lkopf,andAlexanderSmola.Akerneltwo-sampletest.JournalofMachineLearningResearch,13(Mar):723\u2013773,2012.[34]StanislauSemeniuta,AliakseiSeveryn,andSylvainGelly.Onaccurateevaluationofgansforlanguagegeneration.arXivpreprintarXiv:1806.04936,2018.[35]ZhitingHu,ZichaoYang,XiaodanLiang,RuslanSalakhutdinov,andEricPXing.Towardcontrolledgenerationoftext.InProceedingsofthe34thInternationalConferenceonMachineLearning-Volume70,pages1587\u20131596.JMLR.org,2017.[36]RuishanLiu,NicoloFusi,andLesterMackey.Modelcompressionwithgenerativeadversarialnetworks.arXivpreprintarXiv:1812.02271,2018.11\f[37]KaimingHe,XRSSJZhang,SRen,andJSun.Deepresiduallearningforimagerecognition.eprint.arXivpreprintarXiv:0706.1234,2015.[38]TilmannGneitingandAdrianERaftery.Strictlyproperscoringrules,prediction,andestimation.JournaloftheAmericanStatisticalAssociation,102(477):359\u2013378,2007.[39]PriyaGoyal,PiotrDoll\u00e1r,RossGirshick,PieterNoordhuis,LukaszWesolowski,AapoKyrola,AndrewTulloch,YangqingJia,andKaimingHe.Accurate,largeminibatchsgd:Trainingimagenetin1hour.arXivpreprintarXiv:1706.02677,2017.[40]GeorgOstrovski,WillDabney,andR\u00e9miMunos.Autoregressivequantilenetworksforgenera-tivemodeling.arXivpreprintarXiv:1806.05575,2018.12\f", "award": [], "sourceid": 6641, "authors": [{"given_name": "Suman", "family_name": "Ravuri", "institution": "DeepMind"}, {"given_name": "Oriol", "family_name": "Vinyals", "institution": "Google DeepMind"}]}