{"title": "Differential Properties of Sinkhorn Approximation for Learning with Wasserstein Distance", "book": "Advances in Neural Information Processing Systems", "page_first": 5859, "page_last": 5870, "abstract": "Applications of optimal transport have recently gained remarkable attention as a result of the computational advantages of entropic regularization. However, in most situations the  Sinkhorn approximation to the Wasserstein distance is replaced by a regularized version that is less accurate but easy to differentiate. In this work we characterize the differential properties of the original Sinkhorn approximation, proving that it enjoys the same smoothness as its regularized version and we explicitly provide an efficient algorithm to compute its gradient. We show that this result benefits both theory and applications: on one hand, high order smoothness confers statistical guarantees to learning with Wasserstein approximations. On the other hand, the gradient formula allows to efficiently solve learning and optimization problems in practice. Promising preliminary experiments complement our analysis.", "full_text": "DifferentialPropertiesofSinkhornApproximationforLearningwithWassersteinDistanceGiuliaLuise1AlessandroRudi2MassimilianoPontil1,3CarloCiliberto1,4{g.luise.16,m.pontil}@ucl.ac.ukalessandro.rudi@inria.frc.ciliberto@imperial.ac.uk1DepartmentofComputerScience,UniversityCollegeLondon,London,UK.2INRIA-D\u00e9partementd\u2019informatique,\u00c9coleNormaleSup\u00e9rieure-PSLResearchUniversity,Paris,France.3IstitutoItalianodiTecnologia,Genova,Italy.4DepartmentofElectricalandElectronicEngineering,ImperialCollege,London,UK.AbstractApplicationsofoptimaltransporthaverecentlygainedremarkableattentionasaresultofthecomputationaladvantagesofentropicregularization.However,inmostsituationstheSinkhornapproximationtotheWassersteindistanceisreplacedbyaregularizedversionthatislessaccuratebuteasytodifferentiate.InthisworkwecharacterizethedifferentialpropertiesoftheoriginalSinkhornapproximation,provingthatitenjoysthesamesmoothnessofitsregularizedversionandweexplicitlyprovideanef\ufb01cientalgorithmtocomputeitsgradient.Weshowthatthisresultbene\ufb01tsboththeoryandapplications:ononehand,highordersmoothnessconfersstatisticalguaranteestolearningwithWassersteinapproximations.Ontheotherhand,thegradientformulaisusedtoef\ufb01cientlysolvelearningandoptimizationproblemsinpractice.Promisingpreliminaryexperimentscomplementouranalysis.1IntroductionApplicationsofoptimaltransporthavebeengainingincreasingattentioninmachinelearning.ThissuccessismainlyduetotherecentintroductionoftheSinkhorndistance[1,2],whichoffersanef\ufb01cientalternativetotheheavycostofevaluatingtheWassersteindistancedirectly.Thecomputa-tionaladvantageshavemotivatedrecentapplicationsinoptimizationandlearningoverthespaceofprobabilitydistributions,wheretheWassersteindistanceisanaturalmetric.However,inthesesettingsadoptingtheSinkhornapproximationrequiressolvingafurtheroptimizationproblemwithrespecttothecorrespondingapproximationfunctionratherthanonlyevaluatingitinapoint.Thisconsistsinabi-levelproblem[3]forwhichitischallengingtoderiveanoptimizationapproach[4].Asaconsequence,aregularizedversionoftheSinkhornapproximationisusuallyconsidered[5,6,7,8,9],forwhichitispossibletoef\ufb01cientlycomputeagradientandthusemployitin\ufb01rst-orderoptimizationmethods[8].Morerecently,alsoef\ufb01cientautomaticdifferentiationstrategieshavebeenproposed[10],withapplicationsrangingfromdictionarylearning[11]toGANs[4]anddiscriminatanalysis[12].Anaturalquestioniswhethertheeasiertractabilityofthisregularizationispaidintermsofaccuracy.Indeed,whileasadirectconsequenceof[13]itcanbeshownthattheoriginalSinkhornapproachprovidesasharpapproximationtotheWassersteindistance[13],thesameisnotguaranteedforitsregularizedversion.InthisworkwerecallboththeoreticallyandempiricallythatinoptimizationproblemstheoriginalSinkhornapproximationissigni\ufb01cantlymorefavorablethanitsregularizedcounterpart,whichhasbeenindeednoticedtohaveatendencyto\ufb01ndover-smoothsolutions[14].WetakethisasamotivationtostudythedifferentialpropertiesofthesharpSinkhornwiththegoalofderivingastrategytoaddressoptimizationandlearningproblemsoverprobabilitydistributions.Theprincipalcontributionsofthis32ndConferenceonNeuralInformationProcessingSystems(NeurIPS2018),Montr\u00e9al,Canada.\fworkaretwofold.Firstly,weshowthatbothSinkhornapproximationsarehighlysmoothfunctions,namelyC\u221efunctionsintheinteriorofthesimplex.Despitethecomparabledifferentialproperties,sharpandregularizedSinkhornapproximationsshowaratherdifferentbehaviourwhenadoptedinoptimizationproblemssuchasthecomputationofbarycenters[8].Asaby-productoftheproofofthesmoothness,weobtainanexplicitformulatoef\ufb01cientlycomputethegradientofthesharpSinkhornapproximation,whichprovestobeviablealternativetoautomaticdifferentiation[10].Asasecondmaincontribution,weprovideanovelsoundapproachtothechallengingproblemoflearningwithSinkhornloss,recentlyconsideredin[6].Inparticular,weleveragethesmoothnessoftheSinkhornapproximationtostudythegeneralizationpropertiesofastructuredpredictionestimatoradaptedfrom[15]tothissetting,provingconsistencyand\ufb01nitesamplebounds.Weprovidepreliminaryempiricalevidenceoftheeffectivenessoftheproposedapproach.2Background:OptimalTransportandWassersteinDistanceOptimaltransporttheoryinvestigateshowtocompareprobabilitymeasuresoveradomainX.Givenadistancefunctiond:X\u00d7X\u2192RbetweenpointsonX(e.g.theEuclideandistanceonX=Rd),thegoalofoptimaltransportisto\u201ctranslate\u201d(orlift)ittodistancesbetweenprobabilitydistributionsoverX.ThisallowstoequipthespaceP(X)ofprobabilitymeasuresonXwithametricreferredtoasWassersteindistance,which,forany\u00b5,\u03bd\u2208P(X)andp\u22651isde\ufb01ned(see[16])asWpp(\u00b5,\u03bd)=inf\u03c0\u2208\u03a0(\u00b5,\u03bd)ZX\u00d7Xdp(x,y)d\u03c0(x,y),(1)whereWppdenotesthep-thpowerofWpandwhere\u03a0(\u00b5,\u03bd)isthesetofprobabilitymeasuresontheproductspaceX\u00d7Xwhosemarginalscoincidewith\u00b5and\u03bd;namely\u03a0(\u00b5,\u03bd)={\u03c0\u2208P(X\u00d7X)suchthatP1#\u03c0=\u00b5,P2#\u03c0=\u03bd},(2)withPi(x1,x2)=xitheprojectionoperatorsfori=1,2andPi#\u03c0thepush-forwardof\u03c0[16].Wassersteindistanceondiscretemeasures.Inthefollowingwefocusonmeasureswithdiscretesupport.Inparticular,weconsiderdistributions\u00b5,\u03bd\u2208P(X)thatcanbewrittenaslinearcombi-nations\u00b5=Pni=1ai\u03b4xiand\u03bd=Pmj=1bj\u03b4yjofDirac\u2019sdeltascentredata\ufb01nitenumbernandmofpoints(xi)ni=1and(yj)mj=1inX.Inorderfor\u00b5and\u03bdtobeprobabilities,thevectorweightsa=(a1,...,an)>\u2208\u2206nandb=(b1,...,bm)>\u2208\u2206mmustbelongrespectivelytothenandm-dimensionalsimplex,de\ufb01nedas\u2206n=(cid:8)p\u2208Rn+(cid:12)(cid:12)p>1n=1(cid:9)(3)whereRn+isthesetofvectorsp\u2208Rnwithnon-negativeentriesand1n\u2208Rndenotesthevectorofallones,sothatp>1n=Pni=1piforanyp\u2208Rn.Inthissetting,theevaluationoftheWassersteindistancecorrespondstosolvinganetwork\ufb02owproblem[17]intermsoftheweightvectorsaandbWpp(\u00b5,\u03bd)=minT\u2208\u03a0(a,b)hT,Mi(4)whereM\u2208Rn\u00d7misthecostmatrixwithentriesMij=d(xi,yj)p,hT,MiistheFrobeniusproductTr(T>M)and\u03a0(a,b)denotesthetransportationpolytope\u03a0(a,b)={T\u2208Rn\u00d7m+:T1m=a,T>1n=b},(5)whichspecializes\u03a0(\u00b5,\u03bd)inEq.(2)tothissettingandcontainsallpossiblejointprobabilitieswithmarginals\u201ccorresponding\u201dtoa,b.Inthefollowing,withsomeabuseofnotation,wewilldenotebyWp(a,b)theWassersteindistancebetweenthetwodiscretemeasures\u00b5and\u03bdwithcorrespondingweightvectorsaandb.AnEf\ufb01cientApproximationoftheWassersteinDistance.SolvingtheoptimizationinEq.(4)iscomputationallyveryexpensive[1].Toovercometheissue,thefollowingregularizedversionoftheproblemisconsidered,eS\u03bb(a,b)=minT\u2208\u03a0(a,b)hT,Mi\u22121\u03bbh(T)withh(T)=\u2212n,mXi,j=1Tij(logTij\u22121)(6)2\fwhere\u03bb>0isaregularizationparameter.Indeed,asobservedin[1],theadditionoftheentropyhmakestheproblemsigni\ufb01cantlymoreamenabletocomputations.Inparticular,theoptimizationinEq.(6)canbesolvedef\ufb01cientlyviaSinkhorn\u2019smatrixscalingalgorithm[18].WerefertothefunctioneS\u03bbastheregularizedSinkhorn.IncontrasttotheWassersteindistance,theregularizedSinkhornisdifferentiable(actuallysmooth,asweshowinthisworkinThm.2)andhenceparticularlyappealingforpracticalapplicationswherethegoalistosolveaminimizationoverprobabilityspaces.Indeed,thisapproximationhasbeenrecentlyusedwithsuccessinsettingsrelatedtobarycenterestimation[8,9,19],supervisedlearning[6]anddictionarylearning[7].3Motivation:aBetterApproximationoftheWassersteinDistanceThecomputationalbene\ufb01tprovidedbytheregularizedSinkhornispaidintermsoftheapproximationwithrespecttotheWassersteindistance.Indeed,theentropicterminEq.(6)perturbsthevalueoftheoriginalfunctionalinEq.(4)byatermproportionalto1/\u03bb,leadingtopotentiallyverydifferentbehavioursofthetwofunctions(seeExample1foranexampleofthiseffectinpractice).Inthissense,anaturalcandidateforabetterapproximationisS\u03bb(a,b)=hT\u03bb,MiwithT\u03bb=argminT\u2208\u03a0(a,b)hT,Mi\u22121\u03bbh(T)(7)thatcorrespondstoeliminatingthecontributionoftheentropicregularizerh(T\u03bb)fromeS\u03bbafterthetransportplanT\u03bbhasbeenobtained.ThefunctionS\u03bbwasoriginallyintroducedin[1]astheSinkhornapproximation,althoughrecentliteratureonthetopichasoftenadoptedthisnamefortheregularizedversionEq.(6).Toavoidconfusion,inthefollowingwewillrefertoS\u03bbasthesharpSinkhorn.NotethatwewillinterchangeablyusethenotationsS\u03bb(a,b)andS\u03bb(\u00b5,\u03bd).Theabsenceofthetermh(T\u03bb)isre\ufb02ectedinafasterrateatapproximatingtheWassersteindistance.Proposition1.Let\u03bb>0.Foranypairofdiscretemeasures\u00b5,\u03bd\u2208P(X)withrespectiveweightsa\u2208\u2206nandb\u2208\u2206m,wehave(cid:12)(cid:12)S\u03bb(\u00b5,\u03bd)\u2212W(\u00b5,\u03bd)(cid:12)(cid:12)\u2264c1e\u2212\u03bb(cid:12)(cid:12)eS\u03bb(\u00b5,\u03bd)\u2212W(\u00b5,\u03bd)(cid:12)(cid:12)\u2264c2\u03bb\u22121,(8)wherec1,c2areconstantsindependentof\u03bb,dependingonthesupportof\u00b5and\u03bd.TheproofoftheexponentialdecayinerrorofS\u03bbinEq.(8)(Left)followsfrom[13](Prop.5.1),whilethecorrespondingboundforeS\u03bbEq.(8)(Right)isadirectconsequenceof[20](Prop.2.1).Detailsarepresentedinthesupplementarymaterial.Prop.1suggeststhatthesharpSinkhornprovidesamorenaturalapproximationoftheWassersteindistance.Thisintuitionisfurthersupportedbythefollowingdiscussionwherewecomparethebehaviourofthetwoapproximationsontheproblemof\ufb01ndinganoptimaltransportbarycenterofprobabilitydistributions.WassersteinBarycenters.FindingthebarycenterofasetofdiscreteprobabilitymeasuresD=(\u00b5i)Ni=1isachallengingprobleminappliedoptimaltransportsettings[8].TheWassersteinbarycenterisde\ufb01nedas\u00b5\u2217W=argmin\u00b5BW(\u00b5,D),BW(\u00b5,D)=NXi=1\u03b1iW(\u00b5,\u00b5i),(9)namelythepoint\u00b5\u2217WminimizingtheweightedaveragedistancebetweenalldistributionsinthesetD,with\u03b1iscalarweights.FindingtheWassersteinbarycenteriscomputationallyveryexpensiveandthetypicalapproachistoapproximateitwiththebarycenter\u02dc\u00b5\u2217\u03bb,obtainedbysubstitutingtheWassersteindistanceWwiththeregularizedSinkhorneS\u03bbinthetheobjectivefunctionalofEq.(9).However,inlightoftheresultinProp.1,itisnaturaltoaskwhetherthecorrespondingbaricenter\u00b5\u2217\u03bbofthesharpSinkhornS\u03bbcouldprovideabetterestimateoftheWassersteinone.WhilewedeferathoroughempiricalcomparisonofthetwobarycenterstoSec.6,hereweconsiderasimplescenarioinwhichthesharpSinkhorncanbeprovedtobeasigni\ufb01cantlybetterapproximationoftheWassersteindistance.3\f05101520Bins00.20.40.60.81Massz ySinkhorn = 5 = 50 = 100 = 500Figure1:Comparisonofthesharp(Blue)andregularized(Orange)barycentersoftwoDirac\u2019sdeltas(Black)centeredin0and20fordifferentvaluesof\u03bb.Example1(BarycenteroftwoDeltas).WeconsidertheproblemofestimatingthebarycenteroftwoDirac\u2019sdeltas\u00b51=\u03b4z,\u00b52=\u03b4ycenteredatz=0andy=nwithz,y\u2208Randnaneveninteger.LetX={x0,...,xn}\u2282Rbethesetofallintegersbetween0andnandMthecostmatrixwithsquaredEuclideandistances.Assuminguniformweights\u03b11=\u03b12,itiswell-knownthattheWassersteinbarycenteristhedeltacenteredontheeuclideanmeanofzandy,\u00b5\u2217W=\u03b4z+y2.Adirectcalculation(seeAppendixA)showsinsteadthattheregularizedSinkhornbarycenter\u02dc\u00b5\u2217\u03bb=Pni=0ai\u03b4xitendstospreadthemassacrossallxi\u2208X,accordinglytotheamountofregularization,ai\u221de\u2212\u03bb((z\u2212xi)2+(y\u2212xi)2)i=0,...,n,(10)behavingsimilarlytoa(discretized)Gaussianwithstandarddeviationofthesameorderoftheregularization\u03bb\u22121.Onthecontrary,thesharpSinkhornbarycenterequalstheWassersteinone,namely\u00b5\u2217\u03bb=\u00b5\u2217Wforevery\u03bb>0.AnexampleofthisbehaviourisreportedinFig.1.MainChallengesoftheSharpSinkhorn.Theexampleabove,togetherwithProp.1,providesastrongargumentinsupportofadoptingthesharpSinkhornoveritsregularizedversion.However,whilethegradientoftheregularizedSinkhorncanbeeasilycomputed(see[8]orSec.4),anexplicitformforthegradientofthesharpSinkhornhasnotbeenconsidered.Whileapproachesbasedonautomaticdifferentiationhavebeensuccessfullyrecentlyadopted[4,11,4,12],inthisworkweareinterestedininvestigatingtheanalyticpropertiesofthegradientofthesharpSinkhorn,forwhichweprovideanexplicitalgorithminthefollowing.4DifferentialPropertiesofSinkhornApproximationsInthissectionwepresentaproofofthesmoothnessofthetwoSinkhornapproximationsintroducedabove,andtheexplicitderivationofaformulaforthegradientofS\u03bb.TheseresultswillbekeytoemploythesharpSinkhorninpracticalapplications.TheyareobtainedleveragingtheImplicitFunctionTheorem[21]viaaprooftechniqueanalogoustothatin[12,22,23],whichweoutlineinthissectionanddiscussindetailintheappendix.Theorem2.Forany\u03bb>0,theSinkhornapproximationseS\u03bbandS\u03bb:\u2206n\u00d7\u2206n\u2192RareC\u221eintheinterioroftheirdomain.Thm.2guaranteesbothSinkhornapproximationstobein\ufb01nitelydifferentiable.InSec.5thisresultwillallowustoderiveanestimatorforsupervisedlearningwithSinkhornlossandcharacterizeitscorrespondingstatisticalproperties(i.e.universalconsistencyandlearningrates).TheproofofThm.2isinstrumentaltoderiveaformulaforthegradientofS\u03bb.Wediscusshereitsmainelementsandstepswhilereferringtothesupplementarymaterialforthecompleteproof.Sketchoftheproof.TheproofofThm.2hingesonthecharacterizationofthe(Lagrangian)dualproblemoftheregularizedSinkhorninEq.(6).Thiscanbeformulated(seee.g.[1])asmax\u03b1,\u03b2La,b(\u03b1,\u03b2),La,b(\u03b1,\u03b2)=\u03b1>a+\u03b2>b\u22121\u03bbn,mXi,j=1e\u2212\u03bb(Mij\u2212\u03b1i\u2212\u03b2j)(11)4\fAlgorithm1Computationof\u2207aS\u03bb(a,b)Input:a\u2208\u2206n,b\u2208\u2206m,costmatrixM\u2208Rn,m+,\u03bb>0.T=SINKHORN(a,b,M,\u03bb),\u00afT=T1:n,1:(m\u22121)L=T(cid:12)M,\u00afL=L1:n,1:(m\u22121)D1=diag(T1m),D2=diag(\u00afT>1n)\u22121H=D1\u2212\u00afTD2\u00afT>,f=\u2212L1m+\u00afTD2\u00afL>1ng=H\u22121fReturn:g\u22121n(g>1n)withdualvariables\u03b1\u2208Rnand\u03b2\u2208Rm.BySinkhorn\u2019sscalingtheorem[18],theoptimalprimalsolutionT\u03bbinEq.(7)canbeobtainedfromthedualsolution(\u03b1\u2217,\u03b2\u2217)ofEq.(11)asT\u03bb=diag(e\u03bb\u03b1\u2217)e\u2212\u03bbMdiag(e\u03bb\u03b2\u2217),(12)whereforanyv\u2208Rn,thevectorev\u2208Rndenotestheelement-wiseexponentiationofv(analogouslyformatrices)anddiag(v)\u2208Rn\u00d7nisthediagonalmatrixwithdiagonalcorrespondingtov.SincebothSinkhornapproximationsaresmoothfunctionsofT\u03bb,itissuf\ufb01cienttoshowthatT\u03bb(a,b)itselfissmoothasafunctionofaandb.GiventhecharacterizationofEq.(12)intermsofthedualsolution,thisamountstoprovethat\u03b1\u2217(a,b)and\u03b2\u2217(a,b)aresmoothwithrespectto(a,b),whichisshownleveragingtheImplicitFunctionTheorem[21].ThegradientofSinkhornapproximations.WenowdiscusshowtoderivethegradientofSinkhornapproximationswithrespecttooneofthetwovariables.Inbothcases,thedualproblemintroducedinEq.(11)playsafundamentalrole.Inparticular,aspointedoutin[8],thegradientoftheregularizedSinkhornapproximationcanbeobtaineddirectlyfromthedualsolutionas\u2207aeS\u03bb(a,b)=\u03b1\u2217(a,b),foranya\u2208Rnandb\u2208Rm.Thischaracterizationispossiblebecauseofwell-knownpropertiesofprimalanddualoptimizationproblems[17].ThesharpSinkhornapproximationdoesnothaveaformulationintermsofadualproblemandthereforeasimilarargumentdoesnotapply.Nevertheless,weshowherethatitisstillpossibletoobtainitsgradientinclosedformintermsofthedualsolution.Theorem3.LetM\u2208Rn\u00d7mbeacostmatrix,a\u2208\u2206n,b\u2208\u2206mand\u03bb>0.LetLa,b(\u03b1,\u03b2)bede\ufb01nedasin(11),withargmaxin(\u03b1\u2217,\u03b2\u2217).LetT\u03bbbede\ufb01nedasinEq.(12).Then,\u2207aS\u03bb(a,b)=PT\u2206n(cid:0)AL1m+B\u00afL>1n(cid:1)(13)whereL=T\u03bb(cid:12)M\u2208Rn\u00d7mistheentry-wisemultiplicationbetweenT\u03bbandMand\u00afL\u2208Rn\u00d7m\u22121correspondstoLwiththelastcolumnremoved.ThetermsA\u2208Rn\u00d7nandB\u2208Rn\u00d7m\u22121are[AB]=\u2212\u03bbD(cid:2)\u22072(\u03b1,\u03b2)La,b(\u03b1\u2217,\u03b2\u2217)(cid:3)\u22121,(14)withD=[I0]thematrixconcatenatingthen\u00d7nidentitymatrixIandthematrix0\u2208Rn\u00d7m\u22121withallentriesequaltozero.TheoperatorPT\u2206ndenotestheprojectionontothetangentplaneT\u2206n={x\u2208Rn:Pni=1xi=0}tothesimplex\u2206n.TheproofofThm.3canbefoundinthesupplementarymaterial(Sec.C).Theresultisobtainedby\ufb01rstnotingthatthegradientofS\u03bbischaracterized(viathechainrule)intermsofthethegradients\u2207a\u03b1\u2217(a,b)and\u2207a\u03b2\u2217(a,b)ofthedualsolutions.ThemaintechnicalstepoftheproofistoshowthatthesegradientscorrespondrespectivelytothetermsAandBde\ufb01nedinEq.(14).ToobtainthegradientofS\u03bbinpractice,itisnecessarytocomputetheHessian\u22072(\u03b1,\u03b2)La,b(\u03b1\u2217,\u03b2\u2217)ofthedualfunctional.Adirectcalculationshowsthatthiscorrespondstothematrix\u22072(\u03b1,\u03b2)L(\u03b1\u2217,\u03b2\u2217)=(cid:20)diag(a)\u00afT\u03bb\u00afT\u03bb>diag(\u00afb)(cid:21),(15)where\u00afT\u03bb(equivalently\u00afb)correspondstoT\u03bb(respectivelyb)withthelastcolumn(element)removed.Seethesupplementarymaterial(Sec.C)forthedetailsofthisderivation.5\fFigure2:NestedEllipses:(Left)Sampleinputdata.(Middle)Regularized(Right)sharpSinkhornbarycenters.Fromthediscussionabove,itfollowsthatthegradientofS\u03bbcanbeobtainedinclosedformintermsofthetransportplanT\u03bb.Alg.1reportsanef\ufb01cientapproachtoperformthisoperation.ThealgorithmcanbederivedbysimplealgebraicmanipulationofEq.(13),giventhecharacterizationoftheHessianinEq.(15).Werefertothesupplementarymaterialforthedetailedderivationofthealgorithm.BarycenterswiththesharpSinkhorn.UsingAlg.1wecannowapplytheacceleratedgradientdescentapproachproposedin[8]to\ufb01ndbarycenterswithrespecttothesharpSinkhorn.Fig.2reportsaqualitativeexperimentinspiredbytheonein[8],withthegoalofcomparingthetwoSinkhornbarycenters.Weconsidered30imagesofrandomnestedellipsesona50\u00d750grid.Weinterpreteachimageasadistributionwithsupportonpixels.ThecostmatrixisgivenbythesquaredEuclideandistancesbetweenpixels.Fig.2showssomeexamplesimagesinthedatasetandthecorrespondingbarycentersofthetwoSinkhornapproximations.Whilethebarycenter\u02dc\u00b5\u2217\u03bbofeS\u03bbsuffersablurryeffect,theS\u03bbbarycenter\u00b5\u2217\u03bbisverysharp,suggestingabetterestimateoftheidealone.Computationalconsiderations.DifferentiationofsharpSinkhorncanbeef\ufb01cientlycarriedoutalsoviaAutomaticDifferentiation(AD)[4].HerewecommentonthecomputationalcomplexityofAlg.1andempiricallycomparethecomputationaltimesofourapproachandADasdimensionsandnumberofiterationsgrow.ExperimentswererunonaIntel(R)Xeon(R)CPUE3-1240v3@3.40GHzwith16GBRAM.Theimplementationofthiscomparisonisavailableonline1.ByleveragingtheSherman-Woodburymatrixidentity,itispossibletoshowthatthetotalcostofcomputingthegradient\u2207aS\u03bb(a,b)witha\u2208\u2206nandb\u2208\u2206mviaAlg.1isO(nmmin(n,m)).Inparticular,assumem\u2264n.Then,themostexpensiveoperationsare:O(nm2)formatrixmultiplicationandO(m3)forinvertinganm\u00d7mpositivede\ufb01nitematrix.Bothoperationshavebeenwell-studiedinthenumericsliteratureandef\ufb01cientoff-the-shelfimplementations(BLAS,LAPACK)areavailable,whichexploitthelow-levelparallelstructureofmodernarchitectures(e.g.Choleskyandtriangularinversion).Therefore,evenifapriorithegradienthascomparablealgorithmiccomplexityascomputingtheoriginalWasserstein,itisreasonabletoexpectittobemoreef\ufb01cientinpractice.WecomparedthegradientobtainedwithAlg.1andAutomaticDifferentiation(AD)onrandomhistogramswithdifferentn(yaxis),m(xaxis),andreg.\u03bb=0.02.Fromlefttoright,wereporttheratiotime(AD)/time(Alg.1)forL=10,L=50,L=100iterations.TheresultsshowninFig.3areaveragedon10differentruns.ExperimentsshowthatthereexistregimesinwhichthegradientcomputedinclosedformisaviablealternativetoAutomaticDifferentiation,dependingonthetask.Inparticular,itseemsthatastheratiobetweenthesupportsnandmofthetwodistributionsbecomesmoreunbalanced,Alg.1isconsistentlyfasterthanAD.Accuracyandapproximationerrors.Weconcludethisdiscussiononcomputationalconsiderationwithanoteontheaccuracyofthemethod.Apriori,theexpressionT\u03bb=diag(e\u03bb\u03b1\u2217)e\u2212\u03bbMdiag(e\u03bb\u03b2\u2217)whichisusedtoderiveAlg.1holds\u2018atconvergence\u2019,whileinpractisethereisalimitedbudget(intermsoftimeandmemory)forthecomputationofT\u03bb,i.e.limitednumberofiterations.In[24]asimilarissueisaddressed.InFig.4weempiricallyshowthatplugginganapproximationTL\u03bbobtainedwitha\ufb01xednumberLofiterationsintheformulaforthegradientallowstoreachanwithrespecttothe\u2018truegradient\u2019comparableorslightlybetterthanautomaticdifferentiation.Errorsaremeasuredas\u20182normofthedifferencebetweenapproximatedgradientand\u2018truegradient\u2019,wherethe\u2018truegradient\u2019isobtainedviaautomaticdifferentiationsetting105asmaximumnumberofiterations.Weshowhowerrorsdecreasewithrespecttothenumberofiterationsinatoyexamplewithn=m=2000andregularization\u03bb=0.01,0.02,0.05.Exampleswithn>>mcanbefoundinAppendixC.1.1https://github.com/GiulsLu/OT-gradients6\fFigure3:Ratiooftime(AD)/time(Alg.1)for10,50,and100iterationsoftheSinkhornalgorithmFigure4:AccuracyoftheGradientobtainedwithAlg.1orADwithrespecttothenumberofiterations5LearningwithSinkhornLossFunctionsGiventhecharacterizationofsmoothnessforbothSinkhornapproximations,inthissectionwefocusonaspeci\ufb01capplication:supervisedlearningwithaSinkhornlossfunction.Indeed,theresultofThm.2willallowtocharacterizethestatisticalguaranteesofanestimatordevisedforthisproblemintermsofitsuniversalconsistencyandlearningrates.Differentlyfrom[6],whichadoptedanempiricalriskminimizationapproach,weaddresstheprobleminEq.(16)fromastructuredpredictionperspective[25]followingarecenttrendofworksaddressingtheproblemwithinthesettingofstatisticallearningtheory[15,26,27,28,29].Thiswillallowustostudyalearningalgorithmwithstrongtheoreticalguaranteesthatcanbeef\ufb01cientlyappliedinpractice.ProblemSetting.TheproblemoflearningwiththeregularizedSinkhornhasbeenrecentlycon-sideredin[6]andcanbeformulatedasfollows.LetXbeaninputspaceandY=\u2206nasetofhistograms.Thegoalistoapproximateaminimizeroftheexpectedriskminf:X\u2192YE(f),E(f)=ZX\u00d7YS(f(x),y)d\u03c1(x,y)(16)givena\ufb01nitenumberoftrainingpoints(xi,yi)Ni=1independentlysampledfromtheunknowndistri-bution\u03c1onX\u00d7Y.ThelossfunctionS:Y\u00d7Y\u2192RmeasurespredictionerrorsandinoursettingcorrespondstoeitherS\u03bboreS\u03bb.StructuredPredictionEstimator.Givenatrainingset(xi,yi)Ni=1,weconsiderbf:X\u2192Ythestructuredpredictionestimatorproposedin[15],de\ufb01nedsuchthat\u02c6f(x)=argminy\u2208YNXi=1\u03b1i(x)S(y,yi)(17)foranyx\u2208X.Theweights\u03b1i(x)arelearnedfromthedataandcanbeinterpretedasscoressuggestingthecandidateoutputdistributionytobeclosetoaspeci\ufb01coutputdistributionyiobservedintrainingaccordingtothemetricS.Whiledifferentlearningstrategiescanbeadoptedtolearnthe\u03b1scores,weconsiderthekernel-basedapproachin[15].Inparticular,givenapositivede\ufb01nitekernelk:X\u00d7X\u2192R[30],wehave\u03b1(x)=(\u03b11(x),...,\u03b1(x))>=(K+\u03b3NI)\u22121Kx(18)where\u03b3>0isaregularizationparameterwhileK\u2208RN\u00d7NandKx\u2208RNarerespectivelytheempiricalkernelmatrixwithentriesKij=k(xi,xj)andtheevaluationvectorwithentries(Kx)i=k(x,xi),foranyi,j=1,...,N.ApproachesbasedonNystr\u00f6m[31]orrandomfeatures[32]canbeemployedtolowerthecomputationalcomplexityoflearning\u03b1fromO(n3)toO(n\u221an)whilemaintainingsametheoreticalguaranteesinthefollowing[33,34].7\fRemark1(StructuredPredictionandDifferentiabilityofSinkhorn).ThecurrentworkprovidesbothatheoreticalandpracticalcontributiontotheproblemoflearningwithSinkhornapproximations.Ononehand,thesmoothnessguaranteedbyThm.2willallowustocharacterizethegeneralizationpropertiesoftheestimator(seebelow).Ontheotherhand,Thm.3providesanef\ufb01cientapproachtosolvetheprobleminEq.(17).IndeednotethatthisoptimizationcorrespondstosolvingabarycenterproblemintheformofEq.(9).GeneralizationPropertiesofbf.WenowcharacterizethetheoreticalpropertiesoftheestimatorintroducedinEq.(17).Westartbyshowingbfisuniversallyconsistent,namelythatitachievesminimumexpectedriskasthenumberoftrainingpointsNincreases.Toavoidtechnicalissuesontheboundary,inthefollowingwewillrequireY=\u2206\u0001nforsome\u0001>0tobethesetofpointsp\u2208\u2206nwithpi\u2265\u0001foranyi=1,...,n.ThemaintechnicalstepinthiscontextistoshowthatforanysmoothlossfunctiononY,theestimatorinEq.(17)isconsistent.Inthissense,thecharacterizationofsmoothnessinThm.2iskeytoprovethefollowingresult,incombinationwithThm.4in[15].Theproofcanbefoundinthesupplementarymaterial.Theorem4(UniversalConsistency).LetY=\u2206\u0001n,\u03bb>0andSbeeithereS\u03bborS\u03bb.Letkbeaboundedcontinuousuniversal2kernelonX.ForanyN\u2208Nandanydistribution\u03c1onX\u00d7YletbfN:X\u2192YbetheestimatorinEq.(17)trainedwith(xi,yi)Ni=1pointsindependentlysampledfrom\u03c1and\u03b3N=N\u22121/4.ThenlimN\u2192\u221eE(bfN)=minf:X\u2192YE(f)withprobability1.Toourknowledge,Thm.4isthe\ufb01rstresultcharacterizingtheuniversalconsistencyofanestimatorminimizinganapproximationtotheWassersteindistance.LearningRates.Understandardregularityconditionsontheproblem,ouranalysisalsoallowstoproveexcessriskbounds.Sincetheseconditionsaresigni\ufb01cantlytechnicalwegiveaninformalformulationofthetheorem(seeSec.Dfortherigorousstatementandproof).Theorem5(Excessriskbounds-Informal).LetbfN:X\u2192YbetheestimatorinEq.(17)with\u03b3=N\u22121/2.Understandardregularityconditionson\u03c1(seesupplementarymaterial),E(bfN)\u2212minf:X\u2192YE(f)=O(N\u22121/4)withhighprobabilitywithrespecttosamplingoftrainingdata.Remark2.Recentlyin[36]aSinkhorndivergencewithautocorrelationtermshasbeenprovedtobeasymmetricpositivede\ufb01nitefunctionandhencemoresuitableaslossfunctioninalearningscenario.ThestatisticalguaranteesofThm.4andThm.5stillholdtrueforsuchloss.Weconcludethissectionwithanoteonpreviouswork.Werecallthat[6]hasprovidedthe\ufb01rstgeneralizationboundsforanestimatorminimizingtheregularizedSinkhornloss.InThm.5howeverwecharacterizetheexcessriskboundsoftheestimatorinEq.(17).Thetwoapproachesandanalysisarebasedondifferentassumptionsontheproblem.Therefore,acomparisonofthecorrespondinglearningratesisoutsidethescopeofthisanalysisandisleftforfuturework.6ExperimentsWepresenthereexperimentscomparingthetwoSinkhornapproximationsempirically.Optimizationwasperformedwiththeacceleratedgradientfrom[8]forS\u03bbandBregmanprojections[9]foreS\u03bb.BarycenterswithSinkhornApproximations.WecomparedthequalityofSinkhornbarycentersintermsoftheirapproximationofthe(ideal)Wassersteinbarycenter.Weconsidereddiscretedistributionson100bins,correspondingtotheintegersfrom1to100andasquaredEuclideancostmatrixM.Wegenerateddatasetsof10measureseach,whereonlyk=1,2,10,50(randomlychosen)consecutivebinsaredifferentfromzero,withthenon-zeroentriessampleduniformlybetween0and1(andthennormalizedtosumupto1).WeempiricallychosetheSinkhornregularizationparameter\u03bb2Thisisastandardassumptionsforuniversalconsistency(see[35]).Example:k(x,x0)=e\u2212kx\u2212x0k2/\u03c3.8\fSupportImprovement1%2%10%50%BW(\u02dc\u00b5\u2217\u03bb)\u2212BW(\u00b5\u2217\u03bb)14.914\u00b10.07612.482\u00b10.1352.736\u00b10.5690.258\u00b10.012Table1:AverageabsoluteimprovementintermsoftheidealWassersteinbarycenterfunctionalBWinEq.(9)forsharpvsregularizedSinkhornforbarycentersofrandommeasureswithsparsesupport.ReconstructionError(%)Misclassi\ufb01cationrate#ClassesS\u03bbeS\u03bbHell[26]KDE[37]oftheclassi\ufb01er(%)23.7\u00b10.64.9\u00b10.98.0\u00b12.412.0\u00b14.10.024\u00b10.003422.2\u00b10.931.8\u00b11.129.2\u00b10.840.8\u00b14.20.076\u00b10.0081038.9\u00b10.944.9\u00b12.548.3\u00b12.464.9\u00b11.40.178\u00b10.012Table2:AveragereconstructionerrorsoftheSinkhorn,Hellinger,andKDEestimatorsontheGoogleQuickDrawreconstructionproblem.Errorsmeasuredbyadigitclassi\ufb01erwithbasemisclassi\ufb01cationreportedinlastcolumn.tobethesmallestvaluesuchthattheoutputT\u03bboftheSinkhornalgorithmwouldbewithin10\u22126fromthetransportpolytopein1000iterations.Tab.1reportstheabsoluteimprovementofthebarycenterofthesharpSinkhornwithrespecttotheoneobtainedwiththeregularizedSinkhorn,averagedover10independentdatasetgenerationforeachsupportsizek.Ascanbenoticed,thesharpSinkhornconsistentlyoutperformsitsregularizedcounterpart.Theimprovementismoreevidentformeasureswithsparsesupportandtendstoreduceasthesupportincreases.ThisisinlinewiththeremarkinExample1andthefactthattheregularizationtermineS\u03bbencouragesoversmoothedsolutions.LearningwithWassersteinloss.WeevaluatedtheSinkhornapproximationsinanimagerecon-structionproblemsimilartotheoneconsideredin[37]forstructuredprediction.Givenanimagedepictingadrawing,thegoalistolearnhowtoreconstructthelowerhalfoftheimage(output)giventheupperhalf(input).Similarlyto[8]weinterpreteach(half)imageasanhistogramwithmasscorrespondingtothegraylevels(normalizedtosumupto1).Forallexperiments,accordingto[15],weevaluatedtheperformanceofthereconstructionintermsoftheclassi\ufb01cationaccuracyofanimagerecognitionSVMclassi\ufb01ertrainedonaseparatedataset.TotrainthestructuredpredictionestimatorinEq.(17)weusedaGaussiankernelwithbandwith\u03c3andregularizationparameter\u03b3selectedbycross-validation.GoogleQuickDraw.Wecomparedtheperformanceofthetwoestimatorsonachallengingdataset.Weselectedc=2,4,10classesfromtheGoogleQuickDrawdataset[38]whichconsistsinimagesofsize28\u00d728pixels.Wetrainedthestructuredpredictionestimatorson1000imagesperclassandtestedonother1000images.Werepeatedtheseexperiments5times,eachtimerandomlysamplingadifferenttrainingandtestdataset.Tab.2reportsthereconstructionerror(i.e.theclassi\ufb01cationerroroftheSVMclassi\ufb01er)overimagesreconstructedbytheSinkhornestimators,thestructuredpredictionestimatorwithHellingerloss[15]andtheKernelDependencyEstimator(KDE)[37].Lastcolumnreportsthebasemisclassi\ufb01cationerroroftheSVMclassi\ufb01eronthegroundtruth(i.e.thecompletedigits),providingalowerboundonthesmallestpossiblereconstructionerror.BothSinkhornestimatorsperformsigni\ufb01cantlybetterthantheircompetitors(excepttheHellingerdistanceoutperformingeS\u03bbon4classes).Thisisinlinewiththeintuitionthatoptimaltransportmetricsrespectthewaythemassisdistributedonimages[1,8].Moreover,itisinterestingtonotethattheestimatorofthesharpSinkhornprovidesalwaysbetterreconstructionsthanitsregularizedcounterpart.7ConclusionsInthispaperweinvestigatedthedifferentialpropertiesofSinkhornapproximations.Weprovedthehighordersmoothnessofthetwofunctionsandderivedasaby-productoftheproofanexplicitalgorithmtoef\ufb01cientlycomputethegradientofthesharpSinkhorn.Thecharacterizationofsmooth-nessprovedtobeakeytooltostudythestatisticalpropertiesoftheSinkhornapproximationaslossfunction.Inparticularweconsideredastructuredpredictionestimatorforwhichweproveduniversalconsistencyandexcessriskbounds.Futureworkwillfocusonfurtherapplicationsandamoreextensivecomparisonwiththeexistingliterature.9\fAcknowledgmentsThisworkwassupportedinpartbyEPSRCGrantN.EP/P009069/1,bytheEuropeanResearchCouncil(grantSEQUOIA724063),UKDefenceScienceandTechnologyLaboratory(Dstl)andEngineeringandPhysicalResearchCouncil(EPSRC)undergrantEP/P009069/1.ThisispartofthecollaborationbetweenUSDOD,UKMODandUKEPSRCundertheMultidisciplinaryUniversityResearchInitiative.References[1]MarcoCuturi.Sinkhorndistances:Lightspeedcomputationofoptimaltransport.InC.J.C.Burges,L.Bottou,M.Welling,Z.Ghahramani,andK.Q.Weinberger,editors,AdvancesinNeuralInformationProcessingSystems26,pages2292\u20132300.CurranAssociates,Inc.,2013.[2]GabrielPeyr\u00e9,MarcoCuturi,etal.Computationaloptimaltransport.Technicalreport,2017.[3]StephanDempe.Foundationsofbilevelprogramming.SpringerScience&BusinessMedia,2002.[4]AudeGenevay,GabrielPeyr\u00e9,andMarcoCuturi.Learninggenerativemodelswithsinkhorndivergences.InInternationalConferenceonArti\ufb01cialIntelligenceandStatistics,pages1608\u20131617,2018.[5]NicolasCourty,R\u00e9miFlamary,andDevisTuia.Domainadaptationwithregularizedoptimaltransport.InECML/PKDD2014,LNCS,pages1\u201316,Nancy,France,September2014.[6]CharlieFrogner,ChiyuanZhang,HosseinMobahi,MauricioAraya-Polo,andTomasoPoggio.Learningwithawassersteinloss.InProceedingsofthe28thInternationalConferenceonNeuralInformationProcessingSystems-Volume2,NIPS\u201915,pages2053\u20132061,Cambridge,MA,USA,2015.MITPress.[7]AntoineRolet,MarcoCuturi,andGabrielPeyr\u00e9.Fastdictionarylearningwithasmoothedwassersteinloss.InArthurGrettonandChristianC.Robert,editors,Proceedingsofthe19thInternationalConferenceonArti\ufb01cialIntelligenceandStatistics,volume51ofProceedingsofMachineLearningResearch,pages630\u2013638,Cadiz,Spain,09\u201311May2016.PMLR.[8]MarcoCuturiandArnaudDoucet.Fastcomputationofwassersteinbarycenters.InEricP.XingandTonyJebara,editors,Proceedingsofthe31stInternationalConferenceonMachineLearning,volume32ofProceedingsofMachineLearningResearch,pages685\u2013693,Bejing,China,22\u201324Jun2014.PMLR.[9]Jean-DavidBenamou,GuillaumeCarlier,MarcoCuturi,LucaNenna,andGabrielPeyr\u00e9.Iterativebregmanprojectionsforregularizedtransportationproblems.SIAMJ.Scienti\ufb01cComputing,37(2),2015.[10]NicolasBonneel,GabrielPeyr\u00e9,andMarcoCuturi.Wassersteinbarycentriccoordinates:histogramregressionusingoptimaltransport.ACMTrans.Graph.,35(4):71\u20131,2016.[11]MorganASchmitz,MatthieuHeitz,NicolasBonneel,FredNgole,DavidCoeurjolly,MarcoCuturi,GabrielPeyr\u00e9,andJean-LucStarck.Wassersteindictionarylearning:Optimaltransport-basedunsupervisednonlineardictionarylearning.SIAMJournalonImagingSciences,11(1):643\u2013678,2018.[12]R\u00e9miFlamary,MarcoCuturi,NicolasCourty,andAlainRakotomamonjy.Wassersteindiscriminantanalysis.MachineLearning,May2018.[13]R.CominettiandJ.SanMart\u00edn.Asymptoticanalysisoftheexponentialpenaltytrajectoryinlinearprogramming.MathematicalProgramming,67(1):169\u2013187,Oct1994.[14]J.Ye,P.Wu,J.Z.Wang,andJ.Li.Fastdiscretedistributionclusteringusingwassersteinbarycenterwithsparsesupport.IEEETransactionsonSignalProcessing,65(9):2317\u20132332,May2017.[15]CarloCiliberto,LorenzoRosasco,andAlessandroRudi.Aconsistentregularizationapproachforstructuredprediction.InAdvancesinNeuralInformationProcessingSystems,pages4412\u20134420.2016.[16]C.Villani.OptimalTransport:OldandNew.GrundlehrendermathematischenWissenschaften.SpringerBerlinHeidelberg,2008.[17]D.BertsimasandJ.Tsitsiklis.IntroductiontoLinearOptimization.AthenaScienti\ufb01c,1997.[18]RichardSinkhornandPaulKnopp.Concerningnonnegativematricesanddoublystochasticmatrices.Paci\ufb01cJ.Math.,21(2):343\u2013348,1967.[19]JasonAltschuler,JonathanWeed,andPhilippeRigollet.Near-lineartimeapproximationalgorithmsforoptimaltransportviasinkhorniteration.InNIPS,pages1961\u20131971,2017.10\f[20]MarcoCuturiandGabrielPeyr\u00e9.Asmootheddualapproachforvariationalwassersteinproblems.SIAMJ.ImagingSciences,9(1):320\u2013343,2016.[21]C.H.Edwards.AdvancedCalculusofSeveralVariables.DoverBooksonMathematics.DoverPublications,2012.[22]YoshuaBengio.Gradient-basedoptimizationofhyperparameters.Neuralcomputation,12(8),2000.[23]OlivierChapelle,VladimirVapnik,OlivierBousquet,andSayanMukherjee.Choosingmultipleparametersforsupportvectormachines.Machinelearning,46(1-3):131\u2013159,2002.[24]FabianPedregosa.Hyperparameteroptimizationwithapproximategradient.arXivpreprintarXiv:1602.02355,2016.[25]GHBakir,THofmann,BSch\u00f6lkopf,AJSmola,BTaskar,andSVNVishwanathan.Predictingstructureddata.neuralinformationprocessing,2007.[26]CarloCiliberto,AlessandroRudi,LorenzoRosasco,andMassimilianoPontil.Consistentmultitasklearningwithnonlinearoutputrelations.InAdvancesinNeuralInformationProcessingSystems,2017.[27]AntonOsokin,FrancisBach,andSimonLacoste-Julien.Onstructuredpredictiontheorywithcalibratedconvexsurrogatelosses.InAdvancesinNeuralInformationProcessingSystems,pages302\u2013313,2017.[28]AnnaKorba,AlexandreGarcia,andFlorenced\u2019Alch\u00e9Buc.Astructuredpredictionapproachforlabelranking.arXivpreprintarXiv:1807.02374,2018.[29]AlessandroRudi,CarloCiliberto,GianMariaMarconi,andLorenzoRosasco.Manifoldstructuredprediction.InAdvancesinNeuralInformationProcessingSystems31:AnnualConferenceonNeuralInformationProcessingSystems2018,NeurIPS2018,3-8December2018,Montr\u00e9al,Canada.,pages5615\u20135626,2018.[30]NachmanAronszajn.Theoryofreproducingkernels.TransactionsoftheAmericanmathematicalsociety,68(3):337\u2013404,1950.[31]AlexJSmolaandBernhardSch\u00f6lkopf.Sparsegreedymatrixapproximationformachinelearning.2000.[32]AliRahimiandBenjaminRecht.Randomfeaturesforlarge-scalekernelmachines.InAdvancesinneuralinformationprocessingsystems,pages1177\u20131184,2008.[33]AlessandroRudiandLorenzoRosasco.Generalizationpropertiesoflearningwithrandomfeatures.InAdvancesinNeuralInformationProcessingSystems,pages3215\u20133225,2017.[34]AlessandroRudi,LuigiCarratino,andLorenzoRosasco.Falkon:Anoptimallargescalekernelmethod.InAdvancesinNeuralInformationProcessingSystems,pages3888\u20133898,2017.[35]IngoSteinwartandAndreasChristmann.Supportvectormachines.SpringerScience&BusinessMedia,2008.[36]J.Feydy,T.S\u00e9journ\u00e9,F.-X.Vialard,S.-i.Amari,A.Trouv\u00e9,andG.Peyr\u00e9.InterpolatingbetweenOptimalTransportandMMDusingSinkhornDivergences.ArXive-prints,October2018.[37]J.Weston,O.Chapelle,A.Elisseeff,B.Sch\u00f6lkopf,andV.Vapnik.Kerneldependencyestimation.InAdvancesinNeuralInformationProcessingSystems15,pages873\u2013880,Cambridge,MA,USA,October2003.Max-Planck-Gesellschaft,MITPress.[38]IncGoogle.QuickDrawDataset.https://github.com/googlecreativelab/quickdraw-dataset.[39]T.KolloandD.vonRosen.AdvancedMultivariateStatisticswithMatrices.MathematicsandItsApplications.SpringerNetherlands,2006.[40]MiroslavFiedler.Boundsforeigenvaluesofdoublystochasticmatrices.LinearAlgebraanditsApplications,5(3):299\u2013310,1972.[41]H.Brezis.FunctionalAnalysis,SobolevSpacesandPartialDifferentialEquations.Universitext.SpringerNewYork,2010.[42]AlainBerlinetandChristineThomas-Agnan.ReproducingkernelHilbertspacesinprobabilityandstatistics.SpringerScience&BusinessMedia,2011.11\f[43]V.Moretti.SpectralTheoryandQuantumMechanics:WithanIntroductiontotheAlgebraicFormulation.UNITEXT.SpringerMilan,2013.[44]AndreaCaponnettoandErnestoDeVito.Optimalratesfortheregularizedleast-squaresalgorithm.FoundationsofComputationalMathematics,7(3):331\u2013368,2007.[45]JunhongLin,AlessandroRudi,LorenzoRosasco,andVolkanCevher.Optimalratesforspectralalgorithmswithleast-squaresregressionoverhilbertspaces.AppliedandComputationalHarmonicAnalysis,2018.12\f", "award": [], "sourceid": 2825, "authors": [{"given_name": "Giulia", "family_name": "Luise", "institution": "University College London"}, {"given_name": "Alessandro", "family_name": "Rudi", "institution": "INRIA, Ecole Normale Superieure"}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": "IIT"}, {"given_name": "Carlo", "family_name": "Ciliberto", "institution": "Imperial College London"}]}