{"title": "Assessing Generative Models via Precision and Recall", "book": "Advances in Neural Information Processing Systems", "page_first": 5228, "page_last": 5237, "abstract": "Recent advances in generative modeling have led to an increased interest in the study of statistical divergences as means of model comparison. Commonly used evaluation methods, such as the Frechet Inception Distance (FID), correlate well with the perceived quality of samples and are sensitive to mode dropping. However, these metrics are unable to distinguish between different failure cases since they only yield one-dimensional scores. We propose a novel definition of precision and recall for distributions which disentangles the divergence into two separate dimensions. The proposed notion is intuitive, retains desirable properties, and naturally leads to an efficient algorithm that can be used to evaluate generative models. We relate this notion to total variation as well as to recent evaluation metrics such as Inception Score and FID. To demonstrate the practical utility of the proposed approach we perform an empirical study on several variants of Generative Adversarial Networks and Variational Autoencoders. In an extensive set of experiments we show that the proposed metric is able to disentangle the quality of generated samples from the coverage of the target distribution.", "full_text": "AssessingGenerativeModelsviaPrecisionandRecallMehdiS.M.Sajjadi\u2217MPIforIntelligentSystems,MaxPlanckETHCenterforLearningSystemsOlivierBachemGoogleBrainMarioLucicGoogleBrainOlivierBousquetGoogleBrainSylvainGellyGoogleBrainAbstractRecentadvancesingenerativemodelinghaveledtoanincreasedinterestinthestudyofstatisticaldivergencesasmeansofmodelcomparison.Commonlyusedevaluationmethods,suchastheFr\u00e9chetInceptionDistance(FID),correlatewellwiththeperceivedqualityofsamplesandaresensitivetomodedropping.However,thesemetricsareunabletodistinguishbetweendifferentfailurecasessincetheyonlyyieldone-dimensionalscores.Weproposeanovelde\ufb01nitionofprecisionandrecallfordistributionswhichdisentanglesthedivergenceintotwoseparatedimensions.Theproposednotionisintuitive,retainsdesirableproperties,andnaturallyleadstoanef\ufb01cientalgorithmthatcanbeusedtoevaluategenerativemodels.WerelatethisnotiontototalvariationaswellastorecentevaluationmetricssuchasInceptionScoreandFID.TodemonstratethepracticalutilityoftheproposedapproachweperformanempiricalstudyonseveralvariantsofGenerativeAdversarialNetworksandVariationalAutoencoders.Inanextensivesetofexperimentsweshowthattheproposedmetricisabletodisentanglethequalityofgeneratedsamplesfromthecoverageofthetargetdistribution.1IntroductionDeepgenerativemodels,suchasVariationalAutoencoders(VAE)[12]andGenerativeAdversarialNetworks(GAN)[8],havereceivedagreatdealofattentionduetotheirabilitytolearncomplex,high-dimensionaldistributions.Oneofthebiggestimpedimentstofutureresearchisthelackofquantitativeevaluationmethodstoaccuratelyassessthequalityoftrainedmodels.Withoutaproperevaluationmetricresearchersoftenneedtovisuallyinspectgeneratedsamplesorresorttoqualitativetechniqueswhichcanbesubjective.Oneofthemaindif\ufb01cultiesforquantitativeassessmentliesinthefactthatthedistributionisonlyspeci\ufb01edimplicitly\u2013onecanlearntosamplefromaprede\ufb01neddistribution,butcannotevaluatethelikelihoodef\ufb01ciently.Infact,eveniflikelihoodcomputationwerecomputationallytractable,itmightbeinadequateandmisleadingforhigh-dimensionalproblems[22].Asaresult,surrogatemetricsareoftenusedtoassessthequalityofthetrainedmodels.Someproposedmeasures,suchasInceptionScore(IS)[20]andFr\u00e9chetInceptionDistance(FID)[9],haveshownpromisingresultsinpractice.Inparticular,FIDhasbeenshowntoberobusttoimagecorruption,itcorrelateswellwiththevisual\ufb01delityofthesamples,anditcanbecomputedonunlabeleddata.However,allofthemetricscommonlyappliedtoevaluatinggenerativemodelsshareacrucialweakness:Sincetheyyieldaone-dimensionalscore,theyareunabletodistinguishbetweendifferentfailurecases.Forexample,thegenerativemodelsshowninFigure1obtainsimilarFIDsbutexhibit\u2217ThisworkwasdoneduringaninternshipatGoogleBrain.Correspondence:msajjadi.com,bachem@google.com,lucic@google.com.32ndConferenceonNeuralInformationProcessingSystems(NeurIPS2018),Montr\u00e9al,Canada.\f0.00.20.40.60.81.0Recall0.00.20.40.60.81.0PrecisionMNISTleftMNISTrightCelebAleftCelebArightFigure1:ComparisonofGANstrainedonMNISTandCelebA.AlthoughthemodelsobtainasimilarFIDoneachdataset(32/29forMNISTand65/62forCelebA),theirsampleslookverydifferent.Forexample,themodelontheleftproducesreasonablylookingfacesonCelebA,buttoomanydarkimages.Incontrast,themodelontherightproducesmoreartifacts,butmorevariedimages.Bytheproposedmetric(middle),themodelsontheleftachievehigherprecisionandlowerrecallthanthemodelsontheright,whichsuf\ufb01cestosuccessfullydistinguishingbetweenthefailurecases.differentsamplecharacteristics:themodelonthelefttrainedonMNIST[15]producesrealisticsamples,butonlygeneratesasubsetofthedigits.Ontheotherhand,themodelontherightproduceslow-qualitysampleswhichappeartocoveralldigits.AsimilareffectcanbeobservedontheCelebA[16]dataset.Inthisworkwearguethatasingle-valuesummaryisnotadequatetocomparegenerativemodels.Motivatedbythisshortcoming,wepresentanovelapproachwhichdisentanglesthedivergencebetweendistributionsintotwocomponents:precisionandrecall.GivenareferencedistributionPandalearneddistributionQ,precisionintuitivelymeasuresthequalityofsamplesfromQ,whilerecallmeasurestheproportionofPthatiscoveredbyQ.Furthermore,weproposeanelegantalgorithmwhichcancomputethesequantitiesbasedonsamplesfromPandQ.Inparticular,usingthisapproachweareabletoquantifythedegreeofmodedroppingandmodeinventingbasedonsamplesfromthetrueandthelearneddistributions.Ourcontributions:(1)Weintroduceanovelde\ufb01nitionofprecisionandrecallfordistributionsandprovethatthenotionistheoreticallysoundandhasdesirableproperties,(2)weproposeanef\ufb01cientalgorithmtocomputethesequantities,(3)werelatethesenotionstototalvariation,ISandFID,(4)wedemonstratethatinpracticeonecanquantifythedegreeofmodedroppingandmodeinventingonrealworlddatasets(imageandtextdata),and(5)wecompareseveraltypesofgenerativemodelsbasedontheproposedapproach\u2013toourknowledge,thisisthe\ufb01rstmetricthatexperimentallycon\ufb01rmsthefolklorethatGANsoftenproduce\"sharper\"images,butcansufferfrommodecollapse(highprecision,lowrecall),whileVAEsproduce\"blurry\"images,butcovermoremodesofthedistribution(lowprecision,highrecall).2BackgroundandRelatedWorkThetaskofevaluatinggenerativemodelsisanactiveresearcharea.Herewefocusonrecentworkinthecontextofdeepgenerativemodelsforimageandtextdata.Classicapproachesrelyingoncomparinglog-likelihoodhavereceivedsomecriticismduethefactthatonecanachievehighlikelihood,butlowimagequality,andconversely,high-qualityimagesbutlowlikelihood[22].Whilethelikelihoodcanbeapproximatedinsomesettings,kerneldensityestimationinhigh-dimensionalspacesisextremelychallenging[22,24].Otherfailuremodesrelatedtodensityestimationinhigh-dimensionalspaceshavebeenelaboratedin[10,22].Arecentreviewofpopularapproachesispresentedin[5].TheInceptionScore(IS)[20]offersawaytoquantitativelyevaluatethequalityofgeneratedsamplesinthecontextofimagedata.Intuitively,theconditionallabeldistributionp(y|x)ofsamplescontainingmeaningfulobjectsshouldhavelowentropy,whilethelabeldistributionoverthewholedatasetp(y)shouldhavehighentropy.Formally,IS(G)=exp(Ex\u223cG[dKL(p(y|x),p(y)]).Thescoreiscomputedbasedonaclassi\ufb01er(InceptionnetworktrainedonImageNet).ISnecessitatesalabeleddatasetandhasbeenfoundtobeweakatprovidingguidanceformodelcomparison[3].2\fP(a)Q(b)(c)(d)(e)(f)Figure2:IntuitiveexamplesofPandQ.01(a)01\u03b1\u03b201(b)\u03b201(c)\u03b201(d)\u03b201(e)\u03b201(f)\u03b2Figure3:PRD(Q,P)fortheexamplesabove.0.00.51.0Recall\u03b20.00.51.0Precision\u03b1\u03bb=2\u03bb=1\u03bb=0.5Figure4:Illustrationofthealgorithm.TheFID[9]providesanalternativeapproachwhichrequiresnolabeleddata.Thesamplesare\ufb01rstembeddedinsomefeaturespace(e.g.,aspeci\ufb01clayerofInceptionnetworkforimages).Then,acontinuousmultivariateGaussianis\ufb01ttothedataandthedistancecomputedasFID(x,g)=||\u00b5x\u2212\u00b5g||22+Tr(\u03a3x+\u03a3g\u22122(\u03a3x\u03a3g)12),where\u00b5and\u03a3denotethemeanandcovarianceofthecorrespondingsamples.FIDissensitivetoboththeadditionofspuriousmodesaswellastomodedropping(seeFigure5andresultsin[18]).[4]recentlyintroducedanunbiasedalternativetoFID,theKernelInceptionDistance.Whileunbiased,itsharesanextremelyhighSpearmanrank-ordercorrelationwithFID[14].Anotherapproachistotrainaclassi\ufb01erbetweentherealandfakedistributionsandtouseitsaccuracyonatestsetasaproxyforthequalityofthesamples[11,17].Thisapproachnecessitatestrainingofaclassi\ufb01erforeachmodelwhichisseldompractical.Furthermore,theclassi\ufb01ermightdetectasingledimensionwherethetrueandgeneratedsamplesdiffer(e.g.,barelyvisibleartifactsingeneratedimages)andenjoyhighaccuracy,whichrunstheriskofassigninglowerqualitytoabettermodel.Tothebestofourknowledge,allcommonlyusedmetricsforevaluatinggenerativemodelsareone-dimensionalinthattheyonlyyieldasinglescoreordistance.Anotionofprecisionandrecallhaspreviouslybeenintroducedin[18]wheretheauthorscomputethedistancetothemanifoldofthetruedataanduseitasaproxyforprecisionandrecallonasyntheticdataset.Unfortunately,itisnotpossibletocomputethisquantityformorecomplexdatasets.3PRD:PrecisionandRecallforDistributionsInthissection,wederiveanovelnotionofprecisionandrecalltocompareadistributionQtoareferencedistributionP.ThekeyintuitionisthatprecisionshouldmeasurehowmuchofQcanbegeneratedbya\u201cpart\u201dofPwhilerecallshouldmeasurehowmuchofPcanbegeneratedbya\u201cpart\u201dofQ.Figure2(a)-(d)showfourtoyexamplesforPandQtovisualizethisidea:(a)IfPisbimodalandQonlycapturesoneofthemodes,weshouldhaveperfectprecisionbutonlylimitedrecall.(b)Intheoppositecase,weshouldhaveperfectrecallbutonlylimitedprecision.(c)IfQ=P,weshouldhaveperfectprecisionandrecall.(d)IfthesupportsofPandQaredisjoint,weshouldhavezeroprecisionandrecall.3.1DerivationLetS=supp(P)\u2229supp(Q)bethe(non-empty)intersectionofthesupports2ofPandQ.Then,Pmaybeviewedasatwo-componentmixturewherethe\ufb01rstcomponentPSisaprobabilitydistributiononSandthesecondcomponentPSisde\ufb01nedonthecomplementofS.Similarly,QmayberewrittenasamixtureofQSandQS.Moreformally,forsome\u00af\u03b1,\u00af\u03b2\u2208(0,1],wede\ufb01neP=\u00af\u03b2PS+(1\u2212\u00af\u03b2)PSandQ=\u00af\u03b1QS+(1\u2212\u00af\u03b1)QS.(1)Thisdecompositionallowsforanaturalinterpretation:PSisthepartofPthatcannotbegeneratedbyQ,soitsmixtureweight1\u2212\u00af\u03b2maybeviewedasalossinrecall.Similarly,QSisthepartofQthatcannotbegeneratedbyP,so1\u2212\u00af\u03b1mayberegardedasalossinprecision.Inthecasewhere2ForadistributionPde\ufb01nedona\ufb01nitestatespace\u2126,wede\ufb01nesupp(P)={\u03c9\u2208\u2126|P(\u03c9)>0}.3\fPS=QS,i.e.,thedistributionsPandQagreeonSuptoscaling,\u00af\u03b1and\u00af\u03b2provideuswithasimpletwo-numberprecisionandrecallsummarysatisfyingtheexamplesinFigure2(a)-(d).IfPS6=QS,wearefacedwithaconundrum:ShouldthedifferencesinPSandQSbeattributedtolossesinprecisionorrecall?IsQSinadequately\u201ccovering\u201dPSorisitgenerating\u201cunnecessary\u201dnoise?InspiredbyPRcurvesforbinaryclassi\ufb01cation,weproposetoresolvethispredicamentbyprovidingatrade-offbetweenprecisionandrecallinsteadofatwo-numbersummaryforanytwodistributionsPandQ.Toparametrizethistrade-off,weconsideradistribution\u00b5onSthatsigni\ufb01esa\u201ctrue\u201dcommoncomponentofPSandQSandsimilarlyto(1),wedecomposebothPSandQSasPS=\u03b20\u00b5+(1\u2212\u03b20)P\u00b5andQS=\u03b10\u00b5+(1\u2212\u03b10)Q\u00b5.(2)ThedistributionPSisviewedasatwo-componentmixturewherethe\ufb01rstcomponentis\u00b5andthesecondcomponentP\u00b5signi\ufb01esthepartofPSthatis\u201cmissed\u201dbyQSandshouldthusbeconsideredarecallloss.Similarly,QSisdecomposedinto\u00b5andthepartQ\u00b5thatsigni\ufb01esnoiseandshouldthusbeconsideredaprecisionloss.As\u00b5isvaried,thisleadstoatrade-offbetweenprecisionandrecall.ItshouldbenotedthatunlikePRcurvesforbinaryclassi\ufb01cationwheredifferentthresholdsleadtodifferentclassi\ufb01ers,trade-offsbetweenprecisionandrecallheredonotconstitutedifferentmodelsordistributions\u2013theproposedPRDcurvesonlyserveasadescriptionofthecharacteristicsofthemodelwithrespecttothetargetdistribution.3.2Formalde\ufb01nitionForsimplicity,weconsiderdistributionsPandQthatarede\ufb01nedona\ufb01nitestatespace,thoughthenotionofprecisionandrecallcanbeextendedtoarbitrarydistributions.Bycombining(1)and(2),weobtainthefollowingformalde\ufb01nitionofprecisionandrecall.De\ufb01nition1.For\u03b1,\u03b2\u2208(0,1],theprobabilitydistributionQhasprecision\u03b1atrecall\u03b2w.r.t.Pifthereexistdistributions\u00b5,\u03bdPand\u03bdQsuchthatP=\u03b2\u00b5+(1\u2212\u03b2)\u03bdPandQ=\u03b1\u00b5+(1\u2212\u03b1)\u03bdQ.(3)Thecomponent\u03bdPdenotesthepartofPthatis\u201cmissed\u201dbyQandencompassesbothPSin(1)andP\u00b5in(2).Similarly,\u03bdQdenotesthenoisepartofQandincludesbothQSin(1)andQ\u00b5in(2).De\ufb01nition2.ThesetofattainablepairsofprecisionandrecallofadistributionQw.r.t.adistributionPisdenotedbyPRD(Q,P)anditconsistsofall(\u03b1,\u03b2)satisfyingDe\ufb01nition1andthepair(0,0).ThesetPRD(Q,P)characterizestheabove-mentionedtrade-offbetweenprecisionandrecallandcanbevisualizedsimilarlytoPRcurvesinbinaryclassi\ufb01cation:Figure3(a)-(d)showthesetPRD(Q,P)ona2D-plotfortheexamples(a)-(d)inFigure2.Notehowtheplotdistinguishesbetween(a)and(b):Anysymmetricevaluationmethod(suchasFID)assignsthesecasesthesamescorealthoughtheyarehighlydifferent.TheinterpretationofthesetPRD(Q,P)isfurtheraidedbythefollowingsetofbasicpropertieswhichweproveinSectionA.1intheappendix.Theorem1.LetPandQbeprobabilitydistributionsde\ufb01nedona\ufb01nitestatespace\u2126.ThesetPRD(Q,P)satis\ufb01esthefollowingproperties:(i)(1,1)\u2208PRD(Q,P)\u21d4Q=P(equality)(ii)PRD(Q,P)={(0,0)}\u21d4supp(Q)\u2229supp(P)=\u2205(disjointsupports)(iii)Q(supp(P))=\u00af\u03b1=max(\u03b1,\u03b2)\u2208PRD(Q,P)\u03b1(maxprecision)(iv)P(supp(Q))=\u00af\u03b2=max(\u03b1,\u03b2)\u2208PRD(Q,P)\u03b2(maxrecall)(v)(\u03b10,\u03b20)\u2208PRD(Q,P)if\u03b10\u2208(0,\u03b1],\u03b20\u2208(0,\u03b2],(\u03b1,\u03b2)\u2208PRD(Q,P)(monotonicity)(vi)(\u03b1,\u03b2)\u2208PRD(Q,P)\u21d4(\u03b2,\u03b1)\u2208PRD(P,Q)(duality)Property(i)incombinationwithProperty(v)guaranteesthatQ=PifthesetPRD(Q,P)containstheinterioroftheunitsquare,seecase(c)inFigures2and3.Similarly,Property(ii)assuresthatwheneverthereisnooverlapbetweenPandQ,PRD(Q,P)onlycontainstheorigin,seecase(d)ofFigures2and3.Properties(iii)and(iv)provideaconnectiontothedecompositionin(1)andallowananalysisofthecases(a)and(b)inFigures2and3:Asexpected,Qin(a)achievesamaximumprecisionof1butonlyamaximumrecallof0.5whilein(b),maximumrecallis1butmaximum4\fprecisionis0.5.Notethatthequantities\u00af\u03b1and\u00af\u03b2herearebyconstructionthesameasin(1).Finally,Property(vi)providesanaturalinterpretationofprecisionandrecall:TheprecisionofQw.r.t.PisequaltotherecallofPw.r.t.Qandviceversa.Clearly,notallcasesareassimpleastheexamples(a)-(d)inFigures2and3,inparticularifPandQaredifferentontheintersectionSoftheirsupport.Theexamples(e)and(f)inFigure2andtheresultingsetsPRD(Q,P)inFigure3illustratetheimportanceofthetrade-offbetweenprecisionandrecallaswellastheutilityofthesetPRD(Q,P).Inbothcases,PandQhavethesamesupportwhileQhashighprecisionandlowrecallincase(e)andlowprecisionandhighrecallincase(f).ThisisclearlycapturedbythesetsPRD(Q,P).Intuitively,theexamples(e)and(f)maybeviewedasnoisyversionsofthecases(a)and(b)inFigure2.3.3AlgorithmComputingthesetPRD(Q,P)basedonDe\ufb01nitions1and2isnon-trivialasonehastocheckwhetherthereexistsuitabledistributions\u00b5,\u03bdPand\u03bdQforallpossiblevaluesof\u03b1and\u03b2.Weintroduceanequivalentde\ufb01nitionofPRD(Q,P)inTheorem2thatdoesnotdependonthedistributions\u00b5,\u03bdPand\u03bdQandthatleadstoanelegantalgorithmtocomputepracticalPRDcurves.Theorem2.LetPandQbetwoprobabilitydistributionsde\ufb01nedona\ufb01nitestatespace\u2126.For\u03bb>0de\ufb01nethefunctions\u03b1(\u03bb)=X\u03c9\u2208\u2126min(\u03bbP(\u03c9),Q(\u03c9))and\u03b2(\u03bb)=X\u03c9\u2208\u2126min(cid:18)P(\u03c9),Q(\u03c9)\u03bb(cid:19).(4)Then,itholdsthatPRD(Q,P)={(\u03b8\u03b1(\u03bb),\u03b8\u03b2(\u03bb))|\u03bb\u2208(0,\u221e),\u03b8\u2208[0,1]}.WeprovethetheoreminSectionA.2intheappendix.ThekeyideaofTheorem2isillustratedinFigure4:ThesetofPRD(Q,P)maybeviewedasaunionofsegmentsofthelines\u03b1=\u03bb\u03b2overall\u03bb\u2208(0,\u221e).Eachsegmentstartsattheorigin(0,0)andendsatthemaximalachievablevalue(\u03b1(\u03bb),\u03b2(\u03bb)).ThisprovidesasurprisinglysimplealgorithmtocomputePRD(Q,P)inpractice:Simplycomputepairsof\u03b1(\u03bb)and\u03b2(\u03bb)asde\ufb01nedin(4)foranequiangulargridofvaluesof\u03bb.Foragivenangularresolutionm\u2208N,wecompute[PRD(Q,P)={(\u03b1(\u03bb),\u03b2(\u03bb))|\u03bb\u2208\u039b}where\u039b=ntan(cid:16)im+1\u03c02(cid:17)|i=1,2,...,mo.TocomparedifferentdistributionsQi,onemaysimplyplottheirrespectivePRDcurves[PRD(Qi,P),whileanapproximationofthefullsetsPRD(Qi,P)maybecomputedbyinter-polationbetween[PRD(Qi,P)andtheorigin.Animplementationofthealgorithmisavailableathttps://github.com/msmsajjadi/precision-recall-distributions.3.4ConnectiontototalvariationdistanceTheorem2providesanaturalinterpretationoftheproposedapproach.For\u03bb=1,wehave\u03b1(1)=\u03b2(1)=X\u03c9\u2208\u2126min(P(\u03c9),Q(\u03c9))=X\u03c9\u2208\u2126hP(\u03c9)\u2212(P(\u03c9)\u2212Q(\u03c9))+i=1\u2212\u03b4(P,Q)where\u03b4(P,Q)denotesthetotalvariationdistancebetweenPandQ.Assuch,ournotionofprecisionandrecallmaybeviewedasageneralizationoftotalvariationdistance.4ApplicationtoDeepGenerativeModelsInthissection,weshowthatthealgorithmintroducedinSection3.3canbereadilyappliedtoevaluateprecisionandrecallofdeepgenerativemodels.Inpractice,accesstoPandQisgivenviasamples\u02c6P\u223cPand\u02c6Q\u223cQ.GiventhatbothPandQarecontinuousdistributions,theprobabilityofgeneratingapointsampledfromQis0.Furthermore,thereisstrongempiricalevidencethatcomparingsamplesinimagespacerunstheriskofassigninghigherqualitytoaworsemodel[17,20,22].Acommonremedyistoapplyapre-trainedclassi\ufb01ertrainedonnaturalimagesandtocompare\u02c6Pand\u02c6Qatafeaturelevel.Intuitively,inthisfeaturespacethesamplesshouldbe5\f12345678910NumberofclassesinQ4567891011InceptionscoreInceptionscoreFID020406080100FID0.00.20.40.60.81.0Recall0.00.20.40.60.81.0PrecisionQ123456789100.00.20.40.60.81.0Recall0.00.20.40.60.81.0PrecisionQ12345Figure5:Left:ISandFIDasweremoveandaddclassesofCIFAR-10.ISgenerallyonlyincreases,whileFIDissensitivetoboththeadditionandremovalofclasses.However,itcannotdistinguishbetweenthetwofailurecasesofinventingordroppingmodes.Middle:ResultingPRDcurvesforthesameexperiment.Asexpected,addingmodesleadstoalossinprecision(Q6\u2013Q10),whiledroppingmodesleadstoalossinrecall(Q1\u2013Q4).AsanexampleconsiderQ4andQ6whichhavesimilarFID,butstrikinglydifferentPRDcurves.Thesamebehaviorcanbeobservedforthetaskoftextgeneration,asdisplayedontheplotontheright.Forthisexperiment,wesetPtocontainsamplesfromallclassessothePRDcurvesdemonstratetheincreaseinrecallasweincreasethenumberofclassesinQ.comparedbasedonstatisticalregularitiesintheimagesratherthanrandomartifactsresultingfromthegenerativeprocess[17,19].Followingthislineofwork,we\ufb01rstuseapre-trainedInceptionnetworktoembedthesamples(i.e.usingthePool3layer[9]).Wethenclustertheunionof\u02c6Pand\u02c6Qinthisfeaturespaceusingmini-batchk-meanswithk=20[21].Intuitively,wereducetheproblemtoaonedimensionalproblemwherethehistogramovertheclusterassignmentscanbemeaningfullycompared.Hence,failingtoproducesamplesfromaclusterwithmanysamplesfromthetruedistributionwillhurtrecall,andproducingsamplesinclusterswithoutmanyrealsampleswillhurtprecision.Astheclusteringalgorithmisrandomized,weruntheprocedureseveraltimesandaverageoverthePRDcurves.WenotethatsuchaclusteringismeaningfulasshowninFigure9intheappendixandthatitcanbeef\ufb01cientlyscaledtoverylargesamplesizes[1,2].Westressthatfromthepointofviewoftheproposedalgorithm,onlyameaningfulembeddingisrequired.Assuch,thealgorithmcanbeappliedtovariousdatamodalities.Inparticular,weshowinSection4.1thatbesidesimagedatathealgorithmcanbeappliedtoatextgenerationtask.4.1AddinganddroppingmodesfromthetargetdistributionModecollapseormodedroppingisamajorchallengeinGANs[8,20].Duetothesymmetryofcommonlyusedmetricswithrespecttoprecisionandrecall,theonlywaytoassesswhetherthemodelisproducinglow-qualityimagesordroppingmodesisbyvisualinspection.Instarkcontrast,theproposedmetriccanquantitativelydisentangletheseeffectswhichweempiricallydemonstrate.WeconsiderthreedatasetscommonlyusedintheGANliterature:MNIST[15],Fashion-MNIST[25],andCIFAR-10[13].Thesedatasetsarelabeledandconsistof10balancedclasses.Toshowthesensitivityoftheproposedmeasuretomodedroppingandmodeinventing,we\ufb01rst\ufb01x\u02c6Ptocontainsamplesfromthe\ufb01rst5classesintherespectivetestset.Then,fora\ufb01xedi=1,...,10,wegenerateaset\u02c6Qi,whichconsistsofsamplesfromthe\ufb01rsticlassesfromthetrainingset.Asiincreases,\u02c6Qicoversanincreasingnumberofclassesfrom\u02c6Pwhichshouldresultinhigherrecall.Asweincreaseibeyond5,\u02c6Qiincludessamplesfromanincreasingnumberofclassesthatarenotpresentin\u02c6Pwhichshouldresultinalossinprecision,butnotinrecallastheotherclassesarealreadycovered.Finally,theset\u02c6Q5coversthesameclassesas\u02c6P,soitshouldhavehighprecisionandhighrecall.Figure5(left)showstheISandFIDfortheCIFAR-10dataset(resultsontheotherdatasetsareshowninFigure11intheappendix).SincetheISisnotcomputedw.r.t.areferencedistribution,itisinvarianttothechoiceof\u02c6P,soasweaddclassesto\u02c6Qi,theISincreases.TheFIDdecreasesasweaddmoreclassesuntil\u02c6Q5beforeitstartstoincreaseasweaddspuriousmodes.Critically,FIDfailstodistinguishthecasesofmodedroppingandmodeinventing:\u02c6Q4and\u02c6Q6sharesimilarFIDs.Incontrast,Figure5(middle)showsourPRDcurvesaswevarythenumberofclassesin\u02c6Qi.Addingcorrectmodesleadstoanincreaseinrecall,whileaddingfakemodesleadstoalossofprecision.6\f0123456789Classlabel01k2k3k4k5k0.00.20.40.60.81.0Recall0.00.20.40.60.81.0Precisionleftright0123456789Classlabel01k2k3k4k5kFigure6:ComparingtwoGANstrainedonMNISTwhichbothachieveanFIDof49.Themodelontheleftseemstoproducehigh-qualitysamplesofonlyasubsetofdigits.Ontheotherhand,themodelontherightgenerateslow-qualitysamplesofalldigits.ThehistogramsshowingthecorrespondingclassdistributionsbasedonatrainedMNISTclassi\ufb01ercon\ufb01rmthisobservation.Atthesametime,theclassi\ufb01erismorecon\ufb01dentwhichindicatesdifferentlevelsofprecision(96.7%forthemodelontheleftcomparedto88.6%forthemodelontheright).Finally,wenotethattheproposedPRDalgorithmdoesnotrequirelabeleddata,asopposedtotheISwhichfurtherneedsaclassi\ufb01erthatwastrainedontherespectivedataset.WealsoapplytheproposedapproachontextdataasshowninFigure5(right).Inparticular,weusetheMultiNLIcorpusofcrowd-sourcedsentencepairsannotatedwithtopicandtextualentailmentinformation[23].Afterdiscardingtheentailmentlabel,wecollectalluniquesentencesforthesametopic.Following[6],weembedthesesentencesusingaBiLSTMwith2048cellsineachdirectionandmaxpooling,leadingtoa4096-dimensionalembedding[7].Weconsider5classesfromthisdatasetand\ufb01x\u02c6PtocontainsamplesfromallclassestomeasurethelossinrecallfordifferentQi.Figure5(right)curvessuccessfullydemonstratethesensitivityofrecalltomodedropping.4.2AssessingclassimbalancesforGANsInthissectionweanalyzetheeffectofclassimbalanceonthePRDcurves.Figure6showsapairofGANstrainedonMNISTwhichhavevirtuallythesameFID,butverydifferentPRDcurves.Themodelontheleftgeneratesasubsetofthedigitsofhighquality,whilethemodelontherightseemstogeneratealldigits,buteachhaslowquality.WecannaturallyinterpretthisdifferenceviathePRDcurves:Foradesiredrecallleveloflessthan\u223c0.6,themodelontheleftenjoyshigherprecision\u2013itgeneratesseveraldigitsofhighquality.If,however,onedesiresarecallhigherthan\u223c0.6,themodelontherightenjoyshigherprecisionasitcoversalldigits.Tocon\ufb01rmthis,wetrainanMNISTclassi\ufb01erontheembeddingof\u02c6Pwiththegroundtruthlabelsandplotthedistributionofthepredictedclassesforbothmodels.Thehistogramsclearlyshowthatthemodelontheleftfailedtogenerateallclasses(lossinrecall),whilethemodelontherightisproducingamorebalanceddistributionoverallclasses(highrecall).Atthesametime,theclassi\ufb01erhasanaveragecon\ufb01dence3of96.7%onthemodelontheleftcomparedto88.6%onthemodelontheright,indicatingthatthesamplequalityoftheformerishigher.ThisalignsverywellwiththePRDplots:samplesonthelefthavehighqualitybutarenotdiverseincontrasttothesamplesontherightwhicharediversebuthavelowquality.ThisanalysisrevealsaconnectiontoISwhichisbasedonthepremisethattheconditionallabeldistributionp(y|x)shouldhavelowentropy,whilethemarginalp(y)=Rp(y|x=G(z))dzshouldhavehighentropy.TofurtheranalyzetherelationshipbetweentheproposedapproachandPRDcurves,weplotp(y|x)againstprecisionandp(y)againstrecallinFigure10intheappendix.TheresultsoveralargenumberofGANsandVAEsshowalargeSpearmancorrelationof-0.83forprecisionand0.89forrecall.Wehoweverstresstwokeydifferencesbetweentheapproaches:Firstly,tocomputethequantitiesinISoneneedsaclassi\ufb01erandalabeleddatasetincontrasttotheproposedPRDmetricwhichcanbeappliedonunlabeleddata.Secondly,ISonlycaptureslossesinrecallw.r.t.classes,whileourmetricmeasuresmore\ufb01ne-grainedrecalllosses(seeFigure8intheappendix).3Wedenotetheoutputoftheclassi\ufb01erforitshighestvalueatthesoftmaxlayerascon\ufb01dence.Theintuitionisthathighervaluessignifyhighercon\ufb01denceofthemodelforthegivenlabel.7\f0.00.20.40.60.81.0F8(Recall)0.00.20.40.60.81.0F1/8(Precision)ABCDGANVAEABCDFigure7:F1/8vsF8scoresforalargenumberofGANsandVAEsontheFashion-MNISTdataset.Foreachmodel,weplotthemaximumF1/8andF8scorestoshowthetrade-offbetweenprecisionandrecall.VAEsgenerallyachievelowerprecisionand/orhigherrecallthanGANswhichmatchesthefolklorethatVAEsoftenproducesamplesoflowerqualitywhilebeinglesspronetomodecollapse.Ontherightweshowsamplesfromfourmodelswhichcorrespondtovarioussuccess/failuremodes:(A)highprecision,lowrecall,(B)highprecision,highrecall,(C)lowprecision,lowrecall,and(D)lowprecision,highrecall.4.3ApplicationtoGANsandVAEsWeevaluatetheprecisionandrecallof7GANtypesandtheVAEwith100hyperparametersettingseachasprovidedby[18].Inordertovisualizethisvastquantityofmodels,oneneedstosummarizethePRDcurves.AnaturalideaistocomputethemaximumF1score,whichcorrespondstotheharmonicmeanbetweenprecisionandrecallasasingle-numbersummary.Thisideaisfundamentally\ufb02awedasF1issymmetric.However,itsgeneralization,de\ufb01nedasF\u03b2=(1+\u03b22)p\u00b7r(\u03b22p)+r,providesawaytoquantifytherelativeimportanceofprecisionandrecall:\u03b2>1weighsrecallhigherthanprecision,whereas\u03b2<1weighsprecisionhigherthanrecall.Asaresult,weproposetodistilleachPRDcurveintoapairofvalues:F\u03b2andF1/\u03b2.Figure7comparesthemaximumF8withthemaximumF1/8forthesemodelsontheFashion-MNISTdataset.Wechoose\u03b2=8asitoffersagoodinsightintothebiastowardsprecisionversusrecall.SinceF8weighsrecallhigherthanprecisionandF1/8doestheopposite,modelswithhigherrecallthanprecisionwillliebelowthediagonalF8=F1/8andmodelswithhigherprecisionthanrecallwilllieabove.Toourknowledge,thisisthe\ufb01rstmetricwhichcon\ufb01rmsthefolklorethatVAEsarebiasedtowardshigherrecall,butmaysufferfromprecisionissues(e.g.,duetoblurringeffects),atleastonthisdataset.Ontheright,weshowsamplesfromfourmodelsontheextremeendsoftheplotforallcombinationsofhighandlowprecisionandrecall.WehaveincludedsimilarplotsontheMNIST,CIFAR-10andCelebAdatasetsintheappendix.5ConclusionQuantitativelyevaluatinggenerativemodelsisachallengingtaskofparamountimportance.Inthisworkweshowthatone-dimensionalscoresarenotsuf\ufb01cienttocapturedifferentfailurecasesofcurrentstate-of-the-artgenerativemodels.Asanalternative,weproposeanovelnotionofprecisionandrecallfordistributionsandprovethatbothnotionsaretheoreticallysoundandhavedesirableproperties.WethenconnectthesenotionstototalvariationdistanceaswellasFIDandISandwedevelopanef\ufb01cientalgorithmthatcanbereadilyappliedtoevaluatedeepgenerativemodelsbasedonsamples.Weinvestigatethepropertiesoftheproposedalgorithmonreal-worlddatasets,includingimageandtextgeneration,andshowthatitcapturestheprecisionandrecallofgenerativemodels.Finally,we\ufb01ndempiricalevidencesupportingthefolklorethatVAEsproducesamplesoflowerquality,whilebeinglesspronetomodecollapsethanGANs.8\fReferences[1]OlivierBachem,MarioLucic,HamedHassani,andAndreasKrause.Fastandprovablygoodseedingsfork-means.InAdvancesinNeuralInformationProcessingSystems(NIPS),2016.[2]OlivierBachem,MarioLucic,SHamedHassani,andAndreasKrause.Approximatek-means++insublineartime.InAAAI,2016.[3]ShaneBarrattandRishiSharma.ANoteontheInceptionScore.arXivpreprintarXiv:1801.01973,2018.[4]Miko\u0142ajBi\u00b4nkowski,DougalJ.Sutherland,MichaelArbel,andArthurGretton.DemystifyingMMDGANs.InInternationalConferenceonLearningRepresentations(ICLR),2018.[5]AliBorji.ProsandConsofGANEvaluationMeasures.arXivpreprintarXiv:1802.03446,2018.[6]Ond\u02c7rejC\u00edfka,AliakseiSeveryn,EnriqueAlfonseca,andKatjaFilippova.Evalall,trustafew,dowrongtonone:Comparingsentencegenerationmodels.arXivpreprintarXiv:1804.07972,2018.[7]AlexisConneau,DouweKiela,HolgerSchwenk,LoicBarrault,andAntoineBordes.SupervisedLearn-ingofUniversalSentenceRepresentationsfromNaturalLanguageInferenceData.arXivpreprintarXiv:1705.02364,2017.[8]IanGoodfellow,JeanPouget-Abadie,MehdiMirza,BingXu,DavidWarde-Farley,SherjilOzair,AaronCourville,andYoshuaBengio.GenerativeAdversarialNetworks.InAdvancesinNeuralInformationProcessingSystems(NIPS),2014.[9]MartinHeusel,HubertRamsauer,ThomasUnterthiner,BernhardNessler,G\u00fcnterKlambauer,andSeppHochreiter.GANstrainedbyatwotime-scaleupdateruleconvergetoaNashequilibrium.InAdvancesinNeuralInformationProcessingSystems(NIPS),2017.[10]FerencHusz\u00e1r.How(not)toTrainyourGenerativeModel:ScheduledSampling,Likelihood,Adversary?arXivpreprintarXiv:1511.05101,2015.[11]DanielJiwoongIm,HeMa,GrahamTaylor,andKristinBranson.QuantitativelyevaluatingGANswithdivergencesproposedfortraining.InInternationalConferenceonLearningRepresentations(ICLR),2018.[12]DiederikPKingmaandMaxWelling.Auto-encodingVariationalBayes.InInternationalConferenceonLearningRepresentations(ICLR),2014.[13]AlexKrizhevskyandGeoffreyHinton.Learningmultiplelayersoffeaturesfromtinyimages,2009.[14]KarolKurach,MarioLucic,XiaohuaZhai,MarcinMichalski,andSylvainGelly.TheGANLandscape:Losses,architectures,regularization,andnormalization.arXivpreprintarXiv:1807.04720,2018.[15]YannLeCun,L\u00e9onBottou,YoshuaBengio,andPatrickHaffner.Gradient-basedlearningappliedtodocumentrecognition.InIEEE,1998.[16]ZiweiLiu,PingLuo,XiaogangWang,andXiaoouTang.Deeplearningfaceattributesinthewild.InProceedingsofInternationalConferenceonComputerVision(ICCV),2015.[17]DavidLopez-PazandMaximeOquab.RevisitingClassi\ufb01erTwo-SampleTests.InInternationalConferenceonLearningRepresentations(ICLR),2016.[18]MarioLucic,KarolKurach,MarcinMichalski,SylvainGelly,andOlivierBousquet.AreGANsCreatedEqual?ALarge-ScaleStudy.InAdvancesinNeuralInformationProcessingSystems(NeurIPS),2018.[19]AugustusOdena,VincentDumoulin,andChrisOlah.Deconvolutionandcheckerboardartifacts.Distill,2016.[20]TimSalimans,IanGoodfellow,WojciechZaremba,VickiCheung,AlecRadford,andXiChen.ImprovedTechniquesforTrainingGANs.InAdvancesinNeuralInformationProcessingSystems(NIPS),2016.[21]DavidSculley.Web-scalek-meansclustering.InInternationalConferenceonWorldWideWeb(WWW),2010.[22]LucasTheis,A\u00e4ronvandenOord,andMatthiasBethge.Anoteontheevaluationofgenerativemodels.InInternationalConferenceonLearningRepresentations(ICLR),2016.[23]AdinaWilliams,NikitaNangia,andSamuelRBowman.Abroad-coveragechallengecorpusforsentenceunderstandingthroughinference.arXivpreprintarXiv:1704.05426,2017.9\f[24]YuhuaiWu,YuriBurda,RuslanSalakhutdinov,andRogerGrosse.Onthequantitativeanalysisofdecoder-basedgenerativemodels.InInternationalConferenceonLearningRepresentations(ICLR),2017.[25]HanXiao,KashifRasul,andRolandVollgraf.Fashion-MNIST:ANovelImageDatasetforBenchmarkingMachineLearningAlgorithms.arXivpreprintarXiv:1708.07747,2017.10\f", "award": [], "sourceid": 2500, "authors": [{"given_name": "Mehdi S. M.", "family_name": "Sajjadi", "institution": "Max Planck Institute for Intelligent Systems and ETH Center for Learning Systems"}, {"given_name": "Olivier", "family_name": "Bachem", "institution": "Google AI (Brain team)"}, {"given_name": "Mario", "family_name": "Lucic", "institution": "Google Brain"}, {"given_name": "Olivier", "family_name": "Bousquet", "institution": "Google Brain (Zurich)"}, {"given_name": "Sylvain", "family_name": "Gelly", "institution": "Google Brain (Zurich)"}]}